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The Charles Sumner School 



More than any other school founded after the Civil War, the Charles 
Sumner School served as the cornerstone for the development of educa- 
tional opportunities for black citizens in the District of Columbia. The sig- 
nificance assigned to its design and construction was indicated by the selec- 
tion of Adolph Cluss as architect for the new building. In 1869, Cluss had 
completed the Benjamin Franklin School; in 1872, he completed Sumner 
School; and in 1873, he won a medal for “Progress in Education and School 
Architecture” for the City of Washington at the International Exposition in 
Vienna, Austria. 

Dedicated on September 2, 1872, the new school was named in honor 
of United States Senator Charles Sumner of Massachusetts, who ranked 
alongside Abraham Lincoln and Thaddeus Stevens in leading the struggle 
for abolition, integration, and nondiscrimination. Upon opening, the 
Sumner building housed eight primary and grammar schools, as well as the 
executive offices of the Superintendent and Board of Trustees of the 
Colored Schools of Washington and Georgetown. The building also housed 
a secondary school, with the first high school graduation for black students 
held in 1877. The school also offered health clinics and adult education 
night classes. 

A recipient of major national and local awards for excellence in restora- 
tion, Sumner School currently houses a museum, an archival library, and 
other cultural programs that focus on the history of public education in the 
District of Columbia. 
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Dedicated to David Wo Steveimsoini (1951-1998) 

Senior Advisor to the Acting Deputy Secretary of Education, 1993-98 



This book is dedicated to the memory of David W. Stevenson. His under- 
standing of the interplay between basic research and education policy facilitated 
the development of this research seminar. From his early days in the sociology 
program at Yale, David began to develop a discipline-specific understanding 
of the structural factors mediating social change. As he became more involved 
in controversial policy issues, he saw the necessity for more definitive empiri- 
cal evidence in their resolution. In the continual efforts of the research and 
policy communities, David’s perspective will continue to enrich conversations 
about the direction of and appropriate methodologies for education reform. 
We acknowledge, with this dedication, his memorable accomplishments and 
our appreciation for his influence on this research seminar. 
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Foreword 



Peggy G. Carr 

Associate Commissioner 

Assessment Division 

National Center for Education Statistics 

In November 1998, a group of outstanding researchers and scholars gath- 
ered at the Charles Sumner School in Washington, DC to explore 
methodological issues related to the measurement of student achievement. 
Within this broad topic, the research seminar also focused more specifically 
on the sharing of perspectives related to the black-white test score gap. This 
sharing enabled the participants to compare their analyses and findings and to 
recommend improvements in data collection and analysis to the National Cen- 
ter for Education Statistics (NCES). Thus, eventually this collegial exchange 
promises to improve the utility of NCES data sets for policymakers in their 
efforts to ensure both excellence and equity in American education. 

Seeking deeper explanations of the test score gap is a critical first step in 
the process of assessing student achievement more accurately. Toward that end, 
the seminar demonstrated the need for NCES to pursue more aggressively the 
development of concepts and methodologies that allow independent analysts to 
unravel the causes of such gaps. Such an “unraveling” requires closer examina- 
tion of the complex interrelationships among resource factors, home and schooling 
influences, family configurations, and achievement outcomes. Further, NCES 
needs to place both cross-sectional and longitudinal data in a broader framework 
and to explicate our findings within diverse social contexts in richer detail. 

The work of the Assessment Division in NCES, in particular, will benefit 
from the development of more explicit constructs that allow better compari- 
sons of achievement results without the confounding interpretations that 
typically characterize conventional statistical presentations. For example, when 
achievement discrepancies between blacks and whites reveal different patterns 
in the northern states as compared to southern states, what type of analysis can 
we conduct that would enlighten our understanding of these historical and 
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This first seminar has reminded us of the value of having researchers, 
scholars, and practitioners come together to advance knowledge in the field of 
achievement research and assessment. The collaboration of the sponsoring agen- 
cies — NCES, RAND, and the Office of Educational Research and Improvement 
(OERI) and its Achievement Institute — with their different missions, exempli- 
fies the desire to integrate discipline-based perspectives toward common 
education reform goals. OERI and NCES acknowledge ongoing opportunities 
to sponsor a series of research seminars in order to ensure continued progress 
toward improving education policies and practices on behalf of our children 
and youth. 

Seeking to engage a broader audience in this collegial exchange, NCES 
has prepared this volume containing the papers originally presented at the 
Charles Sumner School. The exchange of ideas among researchers and 
policymakers remains important to NCES. Still, this publication does not nec- 
essarily reflect the views of NCES or the policies of the U.S. Department of 
Education. Rather, the papers included here represent the views of their re- 
spective authors alone. 
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The idea of a “research seminar” where academic researchers could share 
their current research findings with their federal counterparts took shape ini- 
tially in early 1997. Ongoing discussions about the potential benefits of 
collaboration among the National Center for Education Statistics (NCES), 
RAND, and the National Institute on Student Achievement, Curriculum, and 
Assessment (NISACA) gave rise over the next year to a conceptual structure. 
A number of common interests were identified in the research and policy com- 
munities: periodic updates on complex survey designs and multilevel types of 
analysis. We went on to consider also our broader purposes: providing the 
direction to research that will inform policy developments in education, gener- 
ating wider awareness of education research, and stimulating the development 
of better educational theory. 

Within NCES, new forms of collaborative exchange were discussed. The 
one-day seminar received early support from Gary Phillips and Peggy Carr of 
the Assessment Division. Sharif Shakarani, then of the Assessment Division, 
helped to focus seminar offerings on NCES issues in data collection and analysis 
and fostered further collaboration by endorsing the participation of the differ- 
ent divisions in such a conference. Their understanding of the relevance of 
research updates shaped the concepts under discussion toward NCES needs. 
We are grateful, too, for Peggy’s strong and continuous advocacy and her fi- 
nancial support for the seminar. We appreciate also the substantive support 
offered by Holly Spurlock of the Assessment Division, whose careful and com- 
petent assistance throughout the process proved invaluable to the eventual 
success of the seminar. During this time, Daniel Kasprzyk, Director of the 
Schools and Staffing Program of NCES, also provided critical financial and 
moral support, and we remain grateful for his early commitment. The emerg- 
ing plans for the seminar received support from Pascal D. Forgione, then 
Commissioner of NCES, whose sentiments were always directed toward pro- 
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viding the best research possible in the interests of assisting policymakers to 
improve education. 

As the planning progressed, Joseph Conaty of NISACA provided ongo- 
ing insights in organizational support through his contacts with the academic 
research community. For the critical collaborations he contributed to this en- 
deavor, we express our continuing appreciation. Further, we acknowledge the 
contributions of Marian Robinson, then an intern in Joseph’s office, now at the 
Graduate School of Education, Harvard University, who smoothly executed 
numerous details of planning for the seminar. 

We would also like to thank Marilyn McMillen, Chief Statistician for 
NCES, for broadening the base of participation in the seminar through the 
provision of special funding to cover the travel expenses of graduate students. 

The role of the Education Statistics Services Institute (ESSI) in further- 
ing the broad research and development purposes of the seminar is also very 
much appreciated. ESSI’s ability to facilitate “making the seminar happen” 
made it possible for us to extend collaboration and consultation among all the 
participating groups. Another colleague whose support was critically impor- 
tant in the early stages of development is John Mullens, now of Mathematica 
Policy Research, who worked on the project under the auspices of ESSI. John 
offered substantive contributions to discussions about the importance and struc- 
ture of the seminar, and then cheerfully took the lead in facilitating arrangements 
among all the parties. Later, he played an important role in ensuring that the 
early drafts of the solicited papers arrived in time for review before they were 
distributed to seminar participants. The benefits of the seminar were enhanced 
by John’s grasp of the issues in research and policy and his facilitative skill. 

Our appreciation for managing critical details extends to Bridget Brad- 
ley, then a consultant with Policy Studies Associates and later Policy Analyst 
in the Office of the Deputy Secretary of Education, who offered invaluable 
logistical support to our efforts to plan the seminar. Her gracious manner comple- 
mented her careful attention to making and monitoring arrangements, and we 
thank her sincerely for her efforts. 



We extend very special thanks to the organizing committee that had ma- 
jor responsibilities for planning and staging the seminar, as follows: Peggy 
Carr, Holly Spurlock, and Daniel Kasprzyk, as well as John Mullens. We very 
x appreciate the committee’s efforts, through the seemingly endless meet- 
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ings, messages, and phone calls. Further, this committee, along with Brenda 
Turnbull of Policy Studies Associates and Martin Orland of the Early Child- 
hood and International Crosscutting Studies Division (ECICSD), also 
participated in the detailed planning for the publication of the proceedings, 
and we are indebted to them for their useful suggestions regarding major deci- 
sions about this book. The benefits of their efforts on behalf of the seminar 
should be seen for years to come, as NCES endeavors to ensure continuous 
improvements in data quality and analytical methods. 

On November 9, 1998 at the Charles Sumner School in Washington, DC, 
the seminar took place with approximately 100 participants in attendance. Titled 
“Analytic Issues in the Assessment of Student Achievement,” the research semi- 
nar was jointly sponsored by NCES; the National Institute on Student 
Achievement, Curriculum, and Assessment; and RAND, as we had planned for 
so many months. The beautiful setting, the quality of the papers and the com- 
mentary, and the collaborative and collegial nature of the day’s deliberations 
were the fruition of the long process of preparation. 

With appreciation, we acknowledge the “silent” reviewers of the early 
drafts of the solicited research papers. Their early reviews increased the use- 
fulness and applicability of the presentations and papers. These reviewers, in 
addition to the editors, were Martin Orland, John Ralph, Dan Kasprzyk, Peggy 
Carr, Joseph Conaty, and Holly Spurlock. Their work, though behind the scenes, 
was an important contribution to the substance of the seminar, and we appreci- 
ate their assiduous reviews. 



Subsequently, the papers were forwarded to the colleagues who had agreed 
to serve as discussants for the seminar. Sylvia Johnson (Professor of Education 
at Howard University), Robert M. Hauser (Professor of Sociology at the Uni- 
versity of Wisconsin-Madison), and Valerie E. Lee (Professor of Education at 
the University of Michigan) undertook the task of reviewing each pair of solic- 
ited research papers representing the methodological and conceptual strands 
of the seminar, seen here in Sections I, II, and III. Their comments enabled the 
authors of the solicited papers to make further improvements in their works 
before the seminar; then the discussants prepared their public responses for the 
presentations made during the seminar. We remain grateful for their dedication 
to this time-consuming task that benefited all seminar participants. 



O 



Similarly, we offer our appreciation to Marshall S. Smith and Christo- 
r Jencks, whose presentations lifted our attention from such narrow topics 
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as sampling design and dataset linkages to take a broader look at the effects of 
past analytical methods upon social scientists’ understanding of achievement 
disparities and to share insights into how those understandings have played a 
role in the development of new education policies. Smith and Jencks, each in 
his own way and from his own perspective, explained the vagaries of educa- 
tion research since “the Coleman report” and went on to describe the usefulness 
of better data collection and analysis and of better theories and models. 

Further, we acknowledge with appreciation the assistance of Joseph 
Conaty, John Ralph, and Martin Orland as moderators for the discussions dur- 
ing the seminar, as well as the participation of the seminar attendees (listed in 
the appendix), whose comments enriched the discussions and, therefore, the 
overall outcomes of the seminar. 

Following the event, we made the decision to edit the proceedings for 
publication, recognizing the far-reaching implications of the discussions for 
NCES and desiring to extend the insights to a broader audience. Even more 
ambitious were our later decisions to include the Introduction and the fourth 
section, Policy Perspectives and Concluding Commentary. It was fortunate that 
Anne Meek of ESSI was available for the tasks that these decisions required. 
As a professional editor working closely with us, Anne ensured both the comple- 
tion of the book and its internal coherence. We acknowledge with appreciation 
her grace and her sense of humor throughout the process of preparation. 

In the preparation of this book, special thanks are due to Ron Miller of 
RAND for the design of the cover of the book (which incorporates a photo- 
graph by David Grissmer). We also acknowledge the assistance of staff at ESSI 
who prepared the proceedings for publication, as follows: Allison Arnold, Mariel 
Escudero, Anne Kotchek, Qiwu Liu, Jennie Romolo, and Jennifer Thompson. 
We thank them for their attention to detail and their technical skills, which 
have greatly improved this book for use by researchers, policymakers, and 
educators. 

The persons named here have provided varied kinds and levels of support 
for the seminar and for the production of this book, and we are pleased to 
acknowledge our debt to each of them. However, the final responsibility for 
this publication rests with us, and any remaining deficiencies are solely our 
responsibility. 
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Introduction 

Toward Heuristic Models of Student 
Outcomes and More Effective 
Policy Interventions 

C. Kent McGuire 
Assistant Secretary 

Office of Educational Research and Improvement 



In November 1998, in the research seminar commemorated here in this 
volume, a diverse community of scholars and researchers paused amidst their 
heavy schedules to turn their attention to a questioning of their methods of 
conducting empirical inquiries. Taking stock of a body of work is, of course, 
commendable for a professional group. It is always instructive to learn from 
one another and to consider how to better our efforts; and this seminar pro- 
vided ample opportunity for such learning and consideration along several 
dimensions. 

The seminar, however, went beyond the normal technical matters that 
education researchers typically discuss on such occasions. Rather, the gather- 
ing also shed light on research and policy issues, especially the continuing 
efforts to improve the performance of American education, to enhance greater 
educational equality of opportunity, and to understand the sources of continu- 
ing race-ethnicity achievement discrepancies. These larger purposes are, after 
all, the reasons we collect and analyze data in the first place and the reasons we 
search for improvement in our methods of data collection and analysis. 

That the deliberations took place at the Charles Sumner School was es- 
pecially appropriate for the Office of Educational Research and Improvement 
(OERI). Sumner School, now restored and an architectural treasure of great 
beauty, has long served as an important symbol of minority education. In this 
setting, we were surrounded by a particularly fitting sense of history for this 
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discussion of both the means for measuring student achievement and the rea- 
sons for doing so. 

The deliberations were enriched by multiple disciplinary perspectives. 
The research seminar included sociologists, economists, and education research- 
ers, both new and more established researchers, and federal policymakers, all 
of whom shared their insights with each other. That is, researchers from differ- 
ent disciplines and methodological backgrounds commented on each other’s 
analyses and listened to each other’s recommendations, and federal 
policymakers provided their perspectives on the role of research and the im- 
portant questions that must be addressed. In short, the seminar provided an 
enlightening forum for the exchange of perspectives and research findings, as 
participants contributed their particular expertise to discussions about the mea- 
surement of achievement and the contribution of education research to the 
improvement of schooling. 

Of particular importance are some new insights in the understanding of 
racial and ethnic differences in student achievement. Such differences were 
first brought to our attention nearly 30 years ago by “the Coleman report,” 
when the nation began to move equality of educational opportunity to its en- 
during place on the nation’s agenda. Since then, we have come to understand 
much more about the variables associated with both high and low achieve- 
ment — not nearly as much we would like to know but certainly more than we 
once knew. And OERI has always hoped to play a pivotal role in the empirical 
examination of these questions. 

Over the past 10 to 20 years, the federal government has been improving 
its data collections, and a wide array of analyses continue to be conducted to 
move our understanding beyond Coleman’s findings. These continued adjust- 
ments and processes have helped us to understand the complexity of what we 
are trying to measure and what we are trying to change. A brief synthesis of the 
papers solicited for this seminar will serve to illustrate the details of different 
data sets and, at the same time, help us to understand the systemic obstacles to 
changes in educational policies. 

The papers are organized under three major divisions: (1) Using Experi- 
ments and State-level Data to Assess Student Achievement, (2) Using 
Longitudinal Data to Assess Student Achievement, and (3) Relating Family 
and Schooling Characteristics to Academic Achievement. The last major divi- 

4) Policy Perspectives and Concluding Commentary, presents important 
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observations about research methodology and funding and the connections be- 
tween research and policy, both with a retrospective view and a view toward 
the future. 



In the first essay, Stephen Raudenbush characterizes the state proficiency 
means from the Trial State Assessment of the National Assessment of Educa- 
tional Progress (NAEP) as “difficult to interpret and misleading.” It is their 
multidimensionality that makes proficiency scores difficult to interpret: they 
may look simple at first glance, but actually they reflect many factors — stu- 
dent demographics, school organization and processes, and state policy 
influences. Raudenbush discusses his multilevel analyses that compare states 
on their provision of student resources for learning. Not surprisingly, he finds 
that socially disadvantaged students and ethnic minority students (particularly 
African American, Hispanic American, and Native American) are significantly 
less likely than other students to have access to advanced course-taking oppor- 
tunities, favorable school climates, highly educated teachers, and cognitively 
stimulating classrooms. He also finds substantial variation across states in the 
extent of inequality in access to such resources. Such findings point, as he 
said, toward “sharply defined policy debates concerning ways to improve edu- 
cation.” 

Grissmer and Flanagan speak from a different but equally illuminating 
perspective. Their major focus, fueled by concerns about inconsistency in re- 
search results, is the lack of consensus across the broad and multidisciplinary 
research communities in educational research. In many respects, of course, 
this lack of consensus has been inevitable, given the different research per- 
spectives; the varied points of view expressed by researchers, policymakers, 
and practitioners; and the inherent complexity of education. Grissmer and 
Flanagan believe, therefore, that improvements in data collection and statisti- 
cal methodologies, by themselves, are not sufficient to bring about the kind of 
consensus needed to effectively guide educational policies. Thirty years of re- 
search with nonexperimental data have led to almost no consensus on important 
policy issues, such as the effects of educational resources and educational poli- 
cies on children and the impact of resources on educational outcomes. Further, 
they propose to guide the process of creating consensus through the develop- 
y. . nt of a strategic plan, which would enable experimentation and data collection 
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to provide the quality of data necessary for theory-building and also improve 
the specifications of models used in nonexperimental analysis. 

Grissmer and Flanagan therefore recommend three approaches likely to 
lead to consensus: increasing experimentation, building theories of educational 
process, and improving nonexperimental analysis. They suggest that experi- 
ments have two main purposes: they provide the closest-to-causal explanations 
possible in the social sciences, and they help to validate model specifications 
for nonexperimental data. They present detailed discussions of important policy 
issues and the findings of research, including critical analyses of the “money 
doesn’t matter” issue and the issue of the effects of resources on achievement, 
with examples from the many ways researchers have addressed these ques- 
tions over the years. They also provide insight into such efforts as the Tennessee 
class size experiment, the use of NAEP scores and SAT scores, and new meth- 
ods of analyzing education expenditures. 

In addition to making some methodological recommendations, Grissmer 
and Flanagan explain the process of theory-building cogently and clearly. To 
advance theory-building, they advocate linking the disparate and isolated fields 
of research in education, for example, linking the micro-research on time, rep- 
etition, and review with the research on specific instructional techniques, 
homework, tutoring, class size, and teacher characteristics. Further, to enhance 
the development of modeling assumptions, they recommend linking the re- 
search on physical, emotional, and social development, differences in children, 
delays in development, and resiliency factors. Their suggestions for improve- 
ments encompass the need for experiments, improvements in NAEP data such 
as collecting additional variables from children, and supplemental data from 
teachers, among other things. All in all, their paper offers timely and thought- 
provoking views about the research community’s next steps in improving 
theories of education and models of research, so that eventually the nation can 
indeed achieve its desired goals in education. 

Using Longitudinal Data to Assess 
Student Achievement 

Next, Meredith Phillips offers a number of convincing and far-reaching 
observations about improving methods of data collection and analysis, espe- 
cially in efforts to understand ethnic differences in academic performance. 
~ps most relevant is her observation, echoed by other presenters, that we 
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must study ethnic differences explicitly despite their political sensitivity. She 
explains that socioeconomic factors do not overlap with ethnicity as much as 
researchers have traditionally assumed. Ethnic differences in learning vary be- 
tween the school year and the summer; therefore, the importance of collecting 
data in both spring and fall of each school year should be a major point of 
empirical queries. Further, since the test score gap widens more during el- 
ementary school than during high school, and children’s test scores appear less 
stable during elementary school than during high school, Phillips also calls for 
focusing more surveys on elementary students rather than on high school stu- 
dents. Of particular interest is her assertion that we have learned little about 
ethnic differences because researchers have not adequately studied education 
outside of the formal institution of schooling. Measuring the cognitive skills of 
infants and toddlers prior to their entry into school could help to clarify ethnic 
differences in family influences on achievement. Phillips concludes by remind- 
ing us that “it is not logically necessary to understand the causes of a social 
problem before intervening successfully to fix it.” To those who bear responsi- 
bility for the improvement of American education, this reminder is somewhat 
comforting, in view of the breadth and depth of recommendations made by 
this network of researchers and scholars. 



Ferguson and Brown then discuss the relationship of teacher quality to 
student achievement, in particular, the relationship of teachers’ certification 
test scores to students’ test scores. The evidence they have assembled suggests 
that the black-white test score gap among students reflects a similar test score 
gap among teachers. From several studies, they cite findings suggesting that 
“teachers’ test scores do help in predicting their students’ achievement.” For 
example, scores on the Texas Examination of Current Administrators and Teach- 
ers (TECAT) turned out to be strong predictors of higher student reading and 
math scores in school districts across the state. Ferguson and Brown explicitly 
make the point that ensuring well-qualified teachers in districts where minor- 
ity students are heavily represented is “part of the unfinished business of 
equalizing educational opportunity.” In Alabama, certification testing reduced 
entry into teaching by candidates with weak basic skills and consequently nar- 
rowed the skills gap between new black and white teachers. Since the rejected 
candidates would probably have taught disproportionately in black districts, 
Ferguson and Brown suggest that the policy of initial certification testing is 
probably helping to narrow the test score gap between black and white stu- 



dents in Alabama. Predictive validity has not yet been used as a criterion for 



idating such exams; still, Ferguson and Brown contend that policymakers 
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can safely assume a positive causal relationship between students’ and 
teachers’ scores. 

Relating Family and Schooling Characteristics to 
Academic Achievement 

Brewer and Goldhaber offer additional insights into the relationship of stu- 
dent achievement and teacher qualifications, based on their analyses of data from 
the National Education Longitudinal Study of 1988 (NELS:88). Their linking of 
student-teacher-class elements in NELS:88 permitted these researchers to inves- 
tigate the effects of specific class size, teacher characteristics, and peer effects on 
student achievement, through the use of multivariate statistical models. The 
NELS:88 data enabled the researchers to link students to their particular teachers 
and specific courses. In their analyses, they find that subject-specific teacher 
background in math and science is positively related to student achievement in 
those subjects, as compared to teachers with no advanced degrees or with de- 
grees in non-math subjects. They did not see this pattern repeated in English and 
history. Nor did they find positive effects on achievement associated with teacher 
certification or years of teaching experience. 

While encouraged by the recent improvements in data collection exem- 
plified by NELS:88, Brewer and Goldhaber make pertinent recommendations 
for future data collections. Seeing the link between students and teachers as 
critical, they strongly recommend that such links not only be maintained, but 
also strengthened by the collection of additional data about teachers’ back- 
grounds. Specifically, they suggest the addition of teacher test scores, the years 
that teachers obtained their licenses, and the states where they were licensed. 
Such data would be quite useful now and in the future, since policymakers in 
many states have recently overhauled or are considering changing licensure 
and/or teacher preparation requirements. 

Brewer and Goldhaber point out that items relating to student, parent, and 
teacher beliefs, attitudes, and feelings could be omitted from data collections, 
since policymakers can only indirectly affect these. Further, they raise the ques- 
tions of de-emphasizing the collection of nationally representative samples or of 
sampling fewer schools with more data on students and classes in a smaller num- 
ber of schools. Brewer and Goldhaber are seeking the data quality necessary for 
the use of multivariate statistical models, because researchers find such models 
O persuasive in tackling important policy questions. Brewer and Goldhaber 
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clearly state their belief that the “ultimate reason to collect data is to influence 
public policy in a positive way,” a perspective that supports the continued im- 
provement of data collection and methods of analysis. 

Finally, in their investigation into school-level correlates of student 
achievement, McLaughlin and Drori report linking three sources of data: (1) 
data from the Schools and Staffing Survey (SASS) regarding such school and 
background factors as school size, class size, normative cohesion, teacher in- 
fluence, student behavioral climate, teacher qualifications, and the like; (2) 
student achievement data from statewide assessments; and (3) data from the 
1994 State NAEP fourth grade reading assessment in public schools. These 
researchers constructed a set of 18 composites of data on student background, 
organizational aspects, teachers’ qualification, and school climate perceptions, 
then merged them with school reading and mathematics mean scores. 
McLaughlin and Drori analyzed the relationships of various school organiza- 
tional factors to student achievement, hoping to elicit evidence on the 
correlations between school reform policies and achievement. An important 
finding is that reading scores were higher in schools with smaller class sizes. 
This finding was consistent across grade levels. Another interesting finding is 
that middle and secondary schools in which teachers perceive that they have 
more than average control over classroom practices and influence on school 
policies tend to be schools in which mathematics scores are higher. 

Perhaps more exciting than their findings, however, is the methodology 
McLaughlin and Drori employed and its potential for identifying effective school 
policies. Teasing out the correlates of student achievement through such link- 
ages of databases is a promising venue for researchers and policymakers alike, 
especially since a number of states are turning to reforms that establish conse- 
quences for schools based on their gains in achievement over years. 

Policy Perspectives and Concluding Commentary 

Midway through the seminar, Marshall S. Smith engaged seminar par- 
ticipants in a retrospective look at past policy efforts to monitor and mitigate 
the discrepancies in black- white achievement scores. In his paper, he discusses 
possible explanations for the status of the gap at various points in time and 
concludes by reviewing current policy directions that promise further improve- 
ments in student achievement and recommending increased attention to 
O )erimental field trials. 
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Smith describes the reductions in the black-white achievement gap from 
1971 through 1988, as seen in data from NAEP assessments, referring to a 
paper that he and Jennifer O’Day published in 1991, which reviewed policy 
initiatives and changes in student achievement 25 years after the Coleman re- 
port. Smith, who was at that time dean of the graduate school of education at 
Stanford University, pointed out in his presentation that these reductions re- 
flected consistent and substantial increases in black scores and almost no change 
in white scores. In less than 20 years, the reduction in the achievement gap 
between black and white students was 33-50 percent in reading and 25-40 
percent in mathematics, according to NAEP data. 

Smith summarizes several tentative explanations for this reduction in the 
gap, which occurred between 1971 and 1988, which he and O’Day had first 
discussed in their paper. They had recognized, first, the large decrease in the 
percentage of black children living in poverty: from 65 percent in 1960 to 42 
percent in 1980. Another highly plausible explanation was that preschool at- 
tendance increased substantially for low-income children. Further, Smith notes, 
the educational quality of schools for black students was dramatically enhanced 
with the dismantling of the old dual school system. In addition, the effects of 
Title I — while difficult to assess by numbers alone — included an increase in 
educational resources in schools, lower class sizes, and an emphasis on the 
basics of reading and mathematics. And, as Smith reiterated during the semi- 
nar, Title I also served to focus national attention on the needs of low-income 
students, many of whom were African American. 

Smith reminds seminar participants that he considered the basic skills 
movement an influence in reducing the achievement gap at the secondary level 
during this period. After all, by the mid-1980s over 33 states had required 
students to pass a minimum competency test as a criterion for graduation. The 
resulting instructional emphasis on basic skills, combined with the “high stakes” 
tests, produced the focus and coherence in the curriculum needed for improv- 
ing student achievement. 

Smith goes on to speculate that, by 1990, the effects of the factors identi- 
fied by him and O’Day had begun to diminish in their influence and that, 
therefore, the gap between black and white students’ test scores was no longer 
continuing to narrow. Thus, the current task for policymakers has become to 
identify and implement policy ideas that promise to continue the process of 
"O' "ing the gap initiated in prior decades. This task means thinking hard about, 
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and also building upon, the interventions that brought about the earlier im- 
provements in achievement. 

Smith describes three major objectives at the federal level designed to 
support efforts to improve education in general and also to reduce the gap. 
The first is to create overall conditions as stable and livable as possible for 
all families with children. Smith cites, as efforts toward this objective, recent 
sustained economic growth and specific policies such as the Earned Income 
Tax Credit and the Children’s Health Insurance Plan. The second objective is 
to expand educationally rich opportunities for all students beyond typical 
school schedules. As specific examples Smith lists the development of edu- 
cation standards for the Head Start curriculum, the expansion of Head Start 
enrollment, and increased services through the 21 st Century After-School 
Program. The third is to encourage state and local standards-based reforms. 
Toward this end, federal programs such as Title I and Goals 2000 have been 
aligned to support the state reforms. 

Standards-based reform, considered one of Smith’s major contributions 
to education policy, in effect extends the basic skills movement to a much 
broader scope, with all children expected to attain the higher content and per- 
formance standards, not just basic skills. Even at such an early date as this, it is 
worth examining the promise of such reforms by looking at outcomes within 
the states. What have been the test score results in states with focused and 
coherent strategies in their standards-based reforms? Using NAEP data, Smith 
finds encouraging results in those states — especially North Carolina and Texas — 
with relatively challenging standards, curriculum-aligned tests, accountability 
provisions, extensive teacher training, and special efforts on behalf of low- 
scoring students. It is apparent that, for whatever reasons, some states are doing 
very well in their efforts to improve student outcomes, while others are not. 
Therefore, policymakers are obliged to consider very carefully the evidence 
about interventions that promise to lead to improved student performance. 

Moving to a prospective view, and building a case for increasing experi- 
mental efforts, Smith cites the strength and authority of such studies as the 
Tennessee class size study and those on early reading acquisition at the National 
Institute of Child Health and Human Development (NICHD). He identifies sev- 
eral areas where policy development could well be more adequately informed 
through such studies; for example, methods of incorporating technology into class- 
^ms, the effects of summer school, and replications of the NICHD studies. Smith 
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argues eloquently for increasing the use of experimental field trials in education 
research and suggests that a list of recommendations for consideration for the re- 
search agenda at the Department of Education might come from the seminar. 

Indeed, as Christopher Jencks pointed out in his presentation, for those 
who believe that educational policy should be based upon a more solid eviden- 
tiary structure, the current shortage of any type of randomized field trials in 
education policy represents perhaps the greatest challenge facing education 
policymakers and researchers alike. More pointedly, of course, OERI faces 
this challenge in designing a course for its own research agenda. According to 
Jencks, a major advantage of experimental studies is that the more persistent 
and difficult policy questions can be answered more definitively by theinclur^ 
sion of randomization procedures at the school and classroom levels. These 
questions cannot be answered by improved data collections, more complex 
surveys, or more refined statistical methods alone. Critical policy questions 
such as the debate over ability grouping can be intensely controversial; and to 
resolve such questions by randomized field trials would still entail some un- 
avoidable political fallout, no matter how definitive the findings. 

Then, too, Jencks notes that the idea of randomized trials is rarely ac- 
cepted within the field of education research. There are a number of practical 
obstacles to utilizing experimental methods: they inevitably change established 
school routines, since they necessarily include randomization of students or 
teachers to different schools or classes. It might be possible to convince educa- 
tors that such procedures would constitute a small price to pay, given the very 
useful information to be gained, if only the researchers themselves strongly 
supported experimental studies. Jencks.notes, however, that most education 
researchers are typically unenthusiastic about randomized experiments. In fact, 
he contends that most researchers now have li mi ted knowledge of classic ex- 
perimental studies. 

Still, Jencks insists that the advantages of randomized field trials to 
policymakers are large and attractive. The first advantage lies in the knowl- 
edge to be gained from wider use of experimental methods; the second, in the 
clarity of understanding that results from these intuitively obvious methods. A 
legislator or a school board member, for example, can follow the logic of the 
Tennessee class size experiment, understand how the results were evaluated, 
and see why the results are consistent with what researchers say they mean. 
Nevertheless, Jencks is not suggesting that we abandon descriptive types of 
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research proposals. On the contrary, surveys and experiments complement one 
another, each yielding valuable results necessary for providing the data neces- 
sary for policymaking. But the present dearth of experiments sounds a warning 
to OERI and highlights an imperative need for the next few years. 

Indeed, with such different perspectives and challenging viewpoints brought 
to bear on a single topic, many possible directions were identified for the future 
work of NCES and OERI. Throughout the seminar, presenters and participants 
were persuasive in their descriptions of the necessity of complementing longitu- 
dinal survey data with data collected in the classical research design tradition 
such as the Tennessee class size experiment. Their praise for renewed consider- 
ation of experiments made this issue the predominant theme of the seminar, and 
one with far-reaching implications for the sponsors of the event. 

Taking stock of our empirical methods — more or less the primary reason 
for organizing the seminar — yielded a second theme in the comments from 
presenters and participants. This theme was seen in the abundance of propos- 
als for improvements in the design and analysis of data collections, including 
ways of making longitudinal studies more elaborate; suggestions about the 
addition or deletion of certain types of items on surveys; sampling more stu- 
dents per teacher; collecting longitudinal data more frequently; and gathering 
more measures of teacher quality. Implicit in many of the recommendations is 
the idea of more critical evaluation of the utility of variables and methods in all 
NCES surveys, whether longitudinal or cross-sectional, in order to design bet- 
ter surveys in the future. These suggestions translate into serious considerations 
for OERI and NCES as they move forward with new assessments of student 
achievement, as well as with all other surveys and analyses. 

Last but by no means least, seminar participants emphasized the impor- 
tance of communication among the different research disciplines. They referred 
specifically to the power of experiments to communicate effectively with 
policymakers and other researchers. They expressed appreciation for the semi- 
nar as a good example of such communication and recommended more such 
opportunities. The value of the seminar can easily be seen in the broad, data- 
based dialogue among researchers about the choices facing NCES and OERI 
and presented in this book. Suggestions were made to open the door to new 
partnerships among federal, state, and private researchers and to establish con- 
nections between state-based researchers and federal researchers. Interestingly 
ough, repeated references to the benefits to be gained from openness to a 
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variety of audiences constituted a sub-theme of the seminar. Communication 
is, after all, an essential component of building consensus among researchers, 
scholars, and policymakers. 

In short, the exchanges of this seminar promise researchers and 
policymakers alike that racial and ethnic differences in achievement can be 
explored more effectively than at present, that schools can continue to move 
toward equality of educational opportunity, and that progress toward the im- 
provement of American education requires our continued communication, 
collaboration, and commitment. It is now our task to translate our knowledge 
into improved policies and practices in education for the benefit of our chil- 
dren and our nation. 
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During the past two decades, U.S. researchers, policymakers, and jour- 
nalists have expressed concern that the nation’s schools are failing to prepare 
students to meet the demands of the modem global economy. Researchers have 
interpreted international assessments as revealing serious weaknesses in math- 
ematics and science proficiency (see, for example, Beaton et al. 1996; Medrich 
and Griffith 1992; NCES 1995, 230-231). Although such claims can be strongly 
contested (c.d., Rotberg 1998), they support a broader climate of malaise, and 
even crisis, concerning the performance of U.S. schools. 

In this climate, calls for reform and accountability at every level of the 
education system have taken on greater urgency. The stakes are often high: 
students in Chicago must pass a citywide test to be promoted to the next grade; 
students in Michigan can obtain endorsed diplomas only by passing the state’s 
proficiency test; teachers with high-scoring classrooms can obtain cash re- 
wards in some districts; and school principals are held accountable for school 
mean achievement. 

For comparisons at the state level, the key source of data is the Trial State 
Assessment (TSA) of the National Assessment of Educational Progress (NAEP), 
“the Nation’s Report Card” (c.f., Mullis et al. 1992). Administered every two 
years (though in different subject areas at each administration), TSA enables 
cross-sectional comparisons among participating states in several subject ar- 
eas at several grades and allows estimation of trends in student mean proficiency 
over time. Participation has grown to include more than 40 states and U.S. 
territories. But what are we to make of such comparisons between states? 

Most “users” of the TSA would like to view state proficiency means as 
reflecting the effectiveness of educational provision, policy, and practice within 
O h state. If so, TSA would provide direct evidence of the quality of each 

:R1C 



4 



Stephen W. Raudenbush 



state’s educational system. Talking to those involved in reform, for example, I 
have found it common to view California’s performance on TSA in certain 
subject areas as direct evidence of the failure of reform in that state. Yet even a 
cursory examination of TSA data reveals that state demographic composition, 
including poverty levels and ethnic composition, is strongly associated with 
state mean proficiency — and state trends in proficiency are undoubtedly asso- 
ciated with state trends in demography. Thus, critics claim that state means are 
surrogates of demography more than indicators of educational effects. This 
criticism has led to many calls for statistical adjustment of state means on the 
basis of student social and ethnic background. Indeed, it is possible to compare 
states within strata defined by ethnic background and parental education (as in 
Mullis et al. 1992), but such within- stratum comparisons control background 
differences only roughly and do not take into account the extent to which 
a school’s demographic composition creates a context affecting student 
performance. 



The National Assessment Governing Board, which provides policy di- 
rection to NAEP, has resolutely rejected the notion of reporting statistically 
adjusted state mean proficiency. Board members fear that adjustments for stu- 
dent background will lower expectations for school systems serving 
disadvantaged students. There are also sound statistical reasons to be skeptical 
about adjustments. Suppose, for example, that we use a regression analysis to 
compute state mean residuals, that is, discrepancies between the actual state 
means and the means expected on the basis of student composition. Such re- 
siduals have often been interpreted as indicators of the “value added” by the 
schooling system. Yet, if the regression model fails to include key aspects of 
educational policy and practice, the estimates of the association between stu- 
dent composition and outcomes will be biased. The bias would arise because 
the quality of educational provision and student composition would be posi- 
tively correlated, with the most advantaged students tending to be found in the 
schools with the most favorable resources, policies, and practices. Failing, then, 
to control for the quality of educational provision will inflate estimates of the 
contribution of student demography. This inflation, in turn, will lead to biased 
“value added” indicators. The result is an over- adjustment for demography, 
such that systems serving the most advantaged students will tend to look less 
effective than they are. However, the magnitude of the over- adjustment is im- 
possible to assess in the absence of data on the quality of school policy and 
practice (see Raudenbush and Willms [1995] for a thorough discussion of this 
O em in the context of school evaluation). 
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Interpretation of state proficiency means is thus terribly risky. We cannot 
equate unadjusted state mean proficiency with educational effectiveness as many 
reformers wish, yet adjusted means set up low expectations for states serving 
poor students and are statistically untrustworthy. 

The problem of interpreting the results of the TSA frames the pair of 
investigations I shall discuss in this paper . 1 The debate over the meaning of 
state mean proficiency reflects a longstanding debate about the sources of in- 
equality in academic achievement in the United States. If inequality in family 
background is the key to inequality in educational outcomes, then inequality in 
aggregate family background ought to be key to understanding differences in 
state achievement means. On the other hand, if inequality in school quality is 
key to understanding inequality in individual outcomes, then aggregate school 
quality ought to explain state variation. Fortunately, NAEP provides some rea- 
sonable data at the level of both the student and the school to test these 
propositions. 

Our first investigation, then, tested models for student math proficiency 
within each of the participating states of TSA. This may be likened to a “meta- 
analysis” in which each state’s data provide an independent study of the 
correlates of math proficiency. We examined student social, ethnic, and lin- 
guistic backgrounds, and home educational resources as predictors of student 
proficiency. Yet our models simultaneously included indicators of educational 
quality: course-taking opportunities, school climate, teacher qualifications, and 
cognitive stimulation in the classroom. Our findings, reasonably consistent 
across states, supported both the “home effects” and the “schooling effects” 
explanations: the hypothesized explanatory variables related to student out- 
comes as expected. This exercise may be criticized as merely recapitulating 



1 The research reported here was funded by the National Assessment of Educational 

Progress Data Reporting Program of the National Center for Education Statistics (NCES) 
under a grant to Michigan State University. The views expressed herein do not represent 
the position of NCES. This paper summarizes and discusses findings from two papers: 
“Synthesizing Results from the Trial State Assessment,” to appear in the Journal of 
Educational Statistics ; and “Inequality of Access to Educational Opportunity: A National 
Report Card for Eighth Grade Math,” Educational Evaluation and Policy Analysis, 20(4), 
253-268. Authors of both papers are Raudenbush, S.W., Fotiu, R.P, andCheong, Y.F. I 
wish also to thank Marcy Wallace for administering the many tasks associated with the 
nalysis and Zora M. Ziazi for her work on data analysis. 
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decades of educational research, and not even with the best available data . 2 Yet 
TSA does offer the opportunity to compare results across states, for it is the 
only data set that contains a large, representative sample of students in each of 
many states. 

Perhaps more importantly, the analyses within states bears directly on 
controversies surrounding accountability at the state level. Our key finding 
was that, while states vary substantially in unadjusted proficiency means, once 
we control for NAEP indicators of student background and educational qual- 
ity, nearly all of the state variation vanishes. This makes sense, in that state-level 
policies (e.g., regulations, incentives, and aid) can presumably affect student 
outcomes only by affecting specific educational resources and practices at a 
more local level, i.e., within schools and classrooms. If those local resources 
and practices were fully controlled in our models, there would be no direct role 
for state policy to affect student achievement. 

Yet once we verify that state differences almost entirely reflect variation 
in measurable aspects of student background and school quality, our focus 
logically shifts to these “correlates of proficiency.” In particular, state differ- 
ences in correlates of proficiency that can be manipulated by policy become 
especially salient. This led to our second investigation: a study of state-to-state 
variation in the provision of key educational resources, in particular those re- 
sources found consistently related to student outcomes across states. 

We were especially interested in equality of access to those resources as 
a function of student social and ethnic background. Our logic was as follows: 
having found what many prior studies have found, i.e., that socially disadvan- 
taged and ethnic minority students are at high risk of poor performance, we are 
inclined to ask about the extent to which these students have access to key 
resources for learning. 

Our results were again not surprising, but nonetheless disconcerting: so- 
cially disadvantaged students and ethnic minority students (particularly African 
American, Hispanic American, and Native American students) are significantly 
less likely than other students to have access to favorable course-taking oppor- 



2 The cross-sectional data of the TSA do not enable the degree of control for prior student 
achievement that is possible in a longitudinal study such as NELS. Moreover, NAEP 
indicators of educational policy and practice are not nearly as refined as are those in 
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tunities, school climates, qualified teachers, and cognitively stimulating class- 
rooms. However, what is new and perhaps unique is a second finding based on 
TSA: the degree of social and ethnic inequality of access to resources varies 
substantially by state. This finding led us to propose a novel “report card” for 
states based not on mean outcomes, but rather on the extent to which the schools 
in a state provide key resources for learning. Moreover, our report card allows 
examination not only of state differences in overall access to these resources, 
but also state differences in the extent to which access is equitable as a func- 
tion of social background and ethnicity. 

These analyses, while fruitful in our view, also reveal important limita- 
tions in data provided by the TSA. These limitations are not so much on the 
outcome side, where most attention has focused on the construction of NAEP, 
but rather on the input side. Indicators of student background and especially of 
key educational resources are currently quite limited in the TSA. For example, 
student socioeconomic status is indicated by parental education in our analy- 
ses. Indicators of parental occupation, income, eligibility for free lunch, and 
census-based indicators of neighborhood demographic condition, housing, etc., 
are absent. Regarding school-level organization, NAEP includes indicators of 
disciplinary climate, but no indicators of staff cohesion, control, and expecta- 
tions, or of academic press. Indicators of cognitive stimulation in the classroom 
are few and do not constitute a meaningful or reliable scale. Hence, we settled 
on a single indicator: emphasis on reasoning during math instruction. 

Given the limitations of NAEP indicators of student background, school 
organization, and instruction, our finding that NAEP indicators can account 
for nearly all the variation between states was a pleasant surprise. A more 
refined set of indicators would, however, provide more useful information to 
those who wish to use TSA, not just to “take the temperature” of the states, but 
to identify specific targets and strategies for interventions aimed at reducing 
inequality and thereby improving overall levels of student proficiency. 

In the following pages, I aim first to sketch briefly the longstanding de- 
bate over sources of educational inequality and its implications for 
accountability at the state level. Second, I describe the first phase of our inves- 
tigation: the modeling of student proficiency within states as a function of 
student background and educational resources. Third, I report results of our 
second investigation, which focuses on student access to key educational re- 
Qf "ces in the participating states. A sub-theme in the description of each phase 
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involves challenges of analysis and measurement that also have important im- 
plications for future summaries and uses of data from the TSA. 

Home and School Differences As Sources of State 
Inequality in Mathematics Proficiency 

The debate about how to interpret the results from the TSA mirrors the 
longstanding debate about home and school sources of inequality in student 
outcomes. Social and ethnic inequality in achievement constitutes a trouble- 
some and enduring aspect of schooling in the U.S. Large achievement gaps 
between students of high and low socioeconomic status (SES) and between 
European American students, on the one hand, and African American and/or 
Hispanic students, on the other, have been verified in every major national 
study of secondary students, beginning with Coleman et al. (1966). Yet re- 
searchers have offered contrasting explanations for such inequality. 

Home Environmental Inequality 

From one standpoint, the school is an essentially neutral learning envi- 
ronment passively allowing sharp inequality in home circumstances to translate 
into similar inequalities in learning outcomes. Families have long been known 
to vary substantially in their capacities to provide educational 
environments that foster school readiness and reading literacy (Fraser 1959; 
Wolf 1968). Such differences are linked to social status indicators, including 
income, parental occupation, and parental education (Coleman et al. 1966; 
Peaker 1967). Parents of high social status are more likely than parents of low 
social status to have the resources and skills needed to support their children’s 
academic learning. 

If this explanation were completely sufficient to understand observed 
achievement gaps, variation in student achievement between schools would 
simply reflect the varied home environments of students attending those schools. 
Policy interventions aimed at increasing equity might focus primarily on early 
interventions such as Head Start and on providing support for the families of 
the most disadvantaged children. Interventions at the classroom or school lev- 
els, though perhaps laudable for increasing mean achievement, would hold 
less promise for reducing inequality. 
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School Environmental Inequality 

From an entirely different standpoint, schools are a much more active 
force, subjecting essentially similar children to dramatically different learning 
experiences and thereby actively recreating in each new generation a wide 
intellectual inequality that conforms to the wide inequalities in earnings and 
occupational prestige. Clear expositions of this view appear in Ryan (1971), 
Bowles and Gintis (1976), and Kozol (1991). Tracking (Oakes 1985, 1990), 
differential teacher expectations (Rosenthal and Jacobson 1968; Rist 1970), 
and varied school ethos or climate (Rutter et al. 1979), course requirements 
(Lee and Bryk 1989), teacher subject matter and pedagogical knowledge (Finley 
1984; Rosenbaum 1976), and level of cognitive stimulation in the classroom 
(Page 1990; Rowan, Raudenbush, and Cheong 1993) are aspects of the school- 
ing system often viewed as fostering unequal opportunity and outcomes. 

If inequality of schooling were the sole determinant of inequality of edu- 
cational outcomes, inequality in school mean achievement would reflect school 
differences in policy and practice. Not surprisingly, those who have empha- 
sized the school as a causal agent in creating educational inequality, while 
often endorsing compensatory educational policies, have called for sweeping 
structural reforms in the provision of schooling. These include the elimination 
of tracking, school finance reform that would equalize spending across rich 
and poor districts (Berne 1994), and a recasting of teacher preparation to foster 
more favorable expectations and more cognitively stimulating instruction for 
currently disadvantaged students. If the “school effects” explanation were cor- 
rect, such reforms would reduce or eliminate differences between schools in 
achievement. 

The debate reviewed above leaves school differences in student mean 
outcomes open to vastly different interpretations. One observer might view an 
elevated school mean as simply reflecting an advantaged school composition; 
another would attribute this success to excellent school governance, organiza- 
tion, policy, and instructional practice. Those who study school effects seek to 
measure key aspects of both student composition and school process to assess 
the relative contributions of each and to isolate those contributors to achieve- 
ment that reformers can modify (Fuller 1987; Lee and Bryk 1993). Causal 
inference in such studies is always perilous because student composition and 
school process are inevitably correlated. Thus, if either student composition or 
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school process is not measured well and is still included in the analysis, esti- 
mates of both will be biased. 

Given the difficulty of conducting sound studies of school effects, it is 
not surprising that schemes designed to hold schools accountable for their mean 
achievement levels have encountered intense criticism (Willms 1992). School 
means that are not adjusted for student composition will typically convey an 
overly negative picture of school process in those schools with the most disad- 
vantaged students. However, incorporating adjustments for composition 
typically leads to underestimates of the effectiveness of schools having favor- 
able student composition (Raudenbush and Willms 1995). 3 

Implications of the Debate for Interpreting State Variation in 
Outcomes 

All of the difficulties in interpreting school differences in mean outcomes 
are amplified when interest focuses on state mean differences. First, state means 
are simply aggregates of school means — the same means that have been found 
difficult to interpret in all but the most careful studies. Second, while all of the 
problems associated with interpreting either unadjusted or adjusted school 
means are present in adjusting state means, others are added. For example, the 
association between student composition and school processes will vary from 
state to state, as we show below, making the problem of finding meaningful 
adjustments for student composition even more perplexing. And differences in 
state means will at least partially reflect differences in state policy. Such policy 
differences may also be correlated with school composition and school pro- 
cess, creating extra uncertainty about the sources of state variation. 

Thus, while making good estimates of state mean proficiency appears 
essential to any picture of the condition of the nation’s education system, state 
differences in mean proficiency are, by themselves, intrinsically ambiguous at 
best and misleading at worst because of the inevitable temptation to make 
groundless causal inferences. 



3 



O 



Student advantage is typically positively correlated with effective school process. 
Analyses that control student demographics without incorporating good measures of 
school process will over-estimate the importance of student background, thus leading to 
overly severe adjustments for student background and thereby underestimating the 
effectiveness of schools serving advantaged students. Rarely do school accountability 
idies measure key aspects of school process. 
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The problem of interpreting state means can perhaps be clarified with 
reference to a simple causal model (figure 1). Those who interpret state means 
from TSA are typically interested in the role of state government in improving 
student achievement (arrow F of figure 1). However, in principle, states cannot 
directly alter student learning (which is why arrow F is a “dashed line” rather 
than a solid line). Instead, state policy may affect student achievement indi- 
rectly by encouraging favorable practice and resources at the level of the school 
or teacher (arrow D). Schools and teachers can directly affect student achieve- 
ment (arrow A), though any analysis of such effects must account for student 
background (arrow B) because school and teacher practice are likely corre- 
lated with student background (arrow C). 

The first phase of our analysis uses NAEP data to study arrows A and B, 
i.e., to assess contributing school and teacher quality and the contribution of 
student background in each of 41 states. The second phase considers arrow D, 

Figure 1. Conceptual Model for State-level 
Policy Effect on Student Achievement 
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the differences between states with respect to those school and teacher resources 
and practices found consistently correlated with student achievement. 



Phase I: Correlates of Proficiency within States 

The first phase of our analysis was to study home and school correlates 
of eighth grade mathematics proficiency within each state. Our hypotheses 
were that student social, ethnic, and linguistic background, along with indica- 
tors of the home literacy environment, would be related to mathematics 
proficiency, as in past research; and that indicators of key aspects of school 
quality, such as course-taking opportunities, disciplinary climate, teacher quali- 
fications, and cognitive stimulation in the classroom, would also predict 
proficiency. It was essential in this analysis that effects of student background 
and school quality indicators be adjusted for each other and for other contex- 
tual variables such as the composition of the school. This exercise could be 
viewed as much as a validation study of TSA indicators as a test of theory. We 
wanted to see whether TSA indicators of home background and school quality 
were sufficiently well measured to reproduce essential findings of past research. 
We also sought to examine the power of our within-state models to account for 
variation between states. 



Our expectation was that key variables measured at the student and school 
level would account for most of the variation between states. This expectation 
was driven by substantive, rather than statistical, concerns. Controlling for ex- 
planatory variables at lower levels of aggregation, such as the student or the 
school, need not reduce variation at a higher level, such as the state. The ad- 
justed between-state variation can, in principle, be either smaller or larger than 
the unadjusted between-state variation. However, it stands to reason that states 
will vary in outcomes for two reasons: selection processes and effects of state 
educational policy and practice. Selection processes arise because patterns of 
settlement, fertility, and economic dislocation produce state variation in the 
demographic and cultural backgrounds of students and their families. Educa- 
tional policies and practices of schools vary because of the uniquely 
decentralized character of the U.S. education system and because states and 
localities tailor the provision of education to the populations they serve. How- 
ever, states are limited in the “levers” available to them to affect student 
outcomes. These levers include regulations, incentives, and forms of aid that 
can have only indirect effects on students by affecting district and school lead- 
O p and, ultimately, instruction. It follows that if key aspects of selection, 
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school practice, and instruction are controlled, no state variation will remain to 
be explained. In terms of figure 1 , once arrows A and B are controlled, arrow F 
should be nonsignificant. This makes sense theoretically but may be difficult 
to show empirically with NAEP data because NAEP indicators of school re- 
sources and home background are limited. 

Sample and Measures 

Sample 

The analyses are based on data from 99,980 eighth graders attending 
3,537 schools located in the 41 states and territories participating in the 1992 
Trial State Assessment in mathematics. Thus, the average state sample included 
2,377 students and 86 schools. 

Students within each state were selected by means of a two-stage cluster 
sample with stratification at the first stage. Specifically, schools were first strati- 
fied on the basis of urbanicity, minority concentration, size, and area income; 
then (a) schools were selected at random within strata with a probability pro- 
portional to student grade level enrollment; and (b) students were systematically 
selected from a list of students, given a random starting point, within schools. 
It is essential that the analysis plan take into account the stratified and clus- 
tered nature of the sample. 

Measures 

Table 1 lists the variables used and their descriptive statistics. The vari- 
ables include student outcome data, demographic indicators, home 
environmental indicators, and classroom and school characteristics. 

Measures of math proficiency. The math proficiency data collected as part of NAEP 
involve a matrix-sampling scheme in which each student was observed on only 
a subset of relevant items. Rather than yielding a single measured variable, 
NAEP produces five “plausible values” — random draws from the estimated 
posterior distribution of each student’s “true” outcome given the subset of items 
and other data observed on that student (Johnson, Mazzeo, and Kline 1993). 

Measures of student demographics. Student demographic variables consist of gen- 
der (indicator for male), ethnicity (indicators for Hispanic American, 
non-Hispanic black American, Asian American, and Native American, with 
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Table 1 . Descriptive Statistics for Student- and School-level Variables 
for the Combined Sample 



Variables 


Code 
and range 


Mean 


Standard 

deviation 


Student-level data (99,980 students) 


Outcome variables 


Math proficiency 1 


(-2.96, 3.06) 


0.03 


0.99 


Math proficiency 2 


(-3.82,2.71) 


0.03 


0.99 


Math proficiency 3 


(-3.75, 3.33) 


0.03 


0.99 


Math proficiency 4 


(-3.22, 2.87) 


0.03 


0.99 


Math proficiency 5 


(-3.84, 2.76) 


0.03 


0.99 


Demographics 


Male 


0 = No, 1 = Yes 


0.50 


0.51 


African American 


0 = No, 1 s Yes 


0.15 


0.36 


Hispanic American 


0 = No, 1 = Yes 


0.14 


0.35 


Asian American 


0 = No, 1 = Yes 


0.03 


0.19 


Native American 


0= No, 1 = Yes 


0.02 


0.12 


Not born in U.S. 


0 = No, 1 = Yes 


0.07 


0.26 


Student-level data (99,980 students) 


Home environment 


Living with both parents 


0 = No, 1 = Yes 


0.70 


0.47 


Living with one parent 


0 = No, 1 =Yes 


0.20 


0.41 


Parental education- 
high school diploma 


0 = No, 1 = Yes 


0.30 


0.47 


Parental education- 
more than high school diploma 


0 = No, 1 = Yes 


0.18 


0.40 


Parental education- 
bachelor’s degree or more 


0 = No, 1 = Yes 


0.26 


0.45 


Hours watching TV 


(0, 6) 


3.17 


1.61 


Changed school in past 2 years 


0 = No, 1 = Yes 


0.22 


0.42 


Get newspaper regularly 


0 = No, 1 = Yes 


0.73 


0.46 


More than 25 books in home 


0 = No, 1 = Yes 


0.91 


0.29 


Get magazines regularly 


0 = No, 1 = Yes 


0.76 


0.44 


Classroom characteristics 


Taking algebra 


0 = No, 1 = Yes 


0.19 


0.40 


Taking pre-algebra 


0 = No, 1 = Yes 


0.25 


0.44 


Teaching experience of math teacher 


(1,30) 


13.44 


8.85 


‘ y ‘ 9acher majored in math 


0 = No, 1 = Yes 


0.43 


0.51 


HJC 
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Table 1. Descriptive Statistics for Student- and School-level Variables for the 
Combined Sample (continued) 



Variables 


Code 
and range 


Mean 


Standard 

deviation 


Math teacher majored in math education 


0 = No, 1 = Yes 


0.18 


0.39 


Math teacher did graduate work 


0 = No, 1 = Yes 


0.47 


0.51 


Math teacher emphasized reasoning/ 
analysis in class 


0 = otherwise 
1 = heavy/moderate 


0.46 


0.51 


School-level data (3,537 schools) 


School-level variables 


Median income (in thousands) 


(9.073, 85.567) 


28.80 


10.73 


Instructional dollars per pupil 


(7.5, 17.5) 


67.22 


30.23 


Percent minority 


(1,100) 


28.02 


27.70 


Urban location 


0 = No, 1 = Yes 


0.23 


0.42 


Rural location 


0 = No, 1 = Yes 


0.23 


0.42 


Offering 8th grade algebra for high 
school credits 


0 = No, 1 = Yes 


0.75 


0.43 


Availability of computer 


0 = No, 1 = Yes 


0.83 


0.37 


School climate 


(-3.003, 1.191) 


0.00 


0.63 



European American as the reference group), national origin (indicator for born 
outside the U.S.), family type (indicators for living at home with a single par- 
ent, living at home with both parents, with other type as the reference group), 
and parental education (indicators for high school graduate, some education 
after high school, and college graduate, with not graduated from high school 
or the eighth grader not knowing parents’ educational level as the reference 
group). 

Table 1 presents the descriptive statistics on student demographics for 
the combined 41 states. As table 1 shows, half of the 99,980 students were 
male. African Americans made up 15 percent of the sample; Hispanic Ameri- 
cans, 14 percent; Asians, 3 percent; and Native Americans, 2 percent; and 7 
percent of the students were not bom in the U.S. In addition, 70 percent of the 
students indicated that they had two parents residing at home, and 20 percent 
of students reported that they lived in a single-parent household. For 30 per- 
cent of the sample, either the mom or the dad held a high school diploma; for 
1 8 percent, one parent had some education after high school graduation; and 
far 26 percent, at least one parent graduated from college. 
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Measures of home environment. Home environment variables include amount of 
time watching television, mobility (as indexed by whether a student changed 
schools in the past two years), home literacy environment (indicators for re- 
ceiving a newspaper, having more than 25 books, and subscription of 
magazines). Table 1 indicates that the students spent 3.17 hours daily on aver- 
age watching TV. Less than a quarter of them (22 percent) reported that they 
had changed schools in the past two years. About three-fourths of the students 
(73 percent and 76 percent) indicated that their households regularly got news- 
paper and magazines, respectively. The great majority of the students, 9 1 percent, 
had more than 25 books in their homes. 

Measures of classroom characteristics. Classroom characteristics involve type of 
course (indicators for pre-algebra, algebra, with other course as the reference 
group), the teaching experience and qualifications of the teacher of the student 
(indicators for undergraduate math major in college, math education major in 
college, with other major as the reference group; and an indicator for having a 
graduate degree), as well as teacher-reported emphasis on reasoning in the 
classroom (an indicator for moderate to high emphasis). The data on teacher 
background and pedagogical practice were taken from responses to question- 
naires administered to the mathematics teachers of the students sampled. 

Table 1 shows that 19 percent of the students in the sample enrolled in an 
algebra course and 25 percent of them took pre-algebra. The average number 
of years of teaching experience for the teachers of the students sampled was 
about 13. Furthermore, 43 percent of the students had a teacher who majored 
in mathematics as an undergraduate; 18 percent of the students had a teacher 
who was a math education major; and 47 percent of the students had a teacher 
who got a graduate degree. About half of the students (47 percent) attended a 
classroom where reasoning received moderate to high level of emphasis. 

Measures of school characteristics. School characteristics include the social and 
racial composition of a school as measured by median income and percent 
minority (Hispanic and African American students). Other school-level mea- 
sures are location (indicators for an urban school, a rural school, with suburban 
school as a reference group), and financial and computing resources as in- 
dexed by instructional dollars per pupil and availability of computers (an 
indicator for the availability of computers in a math classroom or a lab for 
most of the time), course offerings (an indicator for the availability of algebra 
O 
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for high school credit), and a scale measuring the disciplinary climate of the 
school. The scale was created from the following items indicating the extent to 
which each was a problem in the school: tardiness, absenteeism, cutting classes, 
physical conflicts, drug and alcohol use, health, teacher absenteeism, racial or 
cultural conflict. Each item was first standardized, and the scale was constructed 
as the average of the nine standardized scores. Average Cronbach’s alpha for 
the 41 states was .79. 

Analytic Approach 

Math Proficiency 

Our strategy for modeling math proficiency has two stages: a within- 
state analysis and a between-state analysis. The within-state analysis uses a 
hierarchical linear model to handle the clustered character of the sample. Sample 
design weights are applied at the student level to accommodate the stratified 
character of the sample and the associated over-sampling of certain subgroups. 
This analysis is replicated for each plausible value and the results pooled as 
recommended in Little and Schenker (1994) and Mislevy (1992), using a spe- 
cialized version of the HLM program (Bryk, Raudenbush, and Congdon 1994) 
originally adapted for multiple plausible values by Arnold, Kaufman, and 
Sedlacek (1992). The output for each state is a vector of parameter estimates 
and their estimated sampling variance matrix. These then provide input data 
for the second stage of the analysis, which involves an empirical Bayes and a 
Bayesian synthesis of findings across states. The syntheses employ the method 
of moments (Raudenbush 1994) and the Gibbs sampling (Gelfand and Smith 
1990). (See Raudenbush, Fotiu, and Cheong [1998] for a full exposition of the 
approach.) Taken together, the two stages have the structure of a planned “meta- 
analysis” (Glass 1976) in which each state’s separate analysis constitutes a 
“study,” and the between-state analysis combines these results. 

Wifhin-sfafe Models 

To address these questions, we first formulated within each state two 
separate two-level hierarchical models, one with and one without covariates 
(measures on student demographics, home environment, and classroom and 
school characteristics). Past research on the associations between the social 
distribution of educational resources and outcomes guided the specification of 
the former model (e.g., Bernstein 1970; Bryk and Thum 1989; Coleman 
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et al. 1966; Finley 1984; Oakes 1985; Page 1990; Raudenbush, Rowan, and 
Cheong 1993; Rosenbaum 1976; and Rutter et al. 1979). The model is 

1. Y... — (3 n . + E B . X ... + u.. + , 

ijk ~0k p=! *pk pijk jk ijk ’ 

where 

Y.. k is the math proficiency score for student i in school j and state k\ 
|3 0jt is the mean for state k, which is adjusted for the school- and student- 
level co variates; 

X p .. k is the p' h covariate, which is centered around the Michigan mean; 
(3 pJt is the regression coefficient associated with each X iJk ; 

Uj k and e. jk are the residual random school and student effects. They are 
assumed independently and normally distributed with a> 2 and G 2 
respectively. 

Estimates of the two variance components, © 2 and G 2 , incorporate varia- 
tion associated with the cluster sample so that the maximum likelihood (ML) 
estimate of each regression coefficient and its standard error incorporates the 
extra variation arising from the clustered nature of the sample. The use of sam- 
pling weights accounts for unequal probability of selection and multiple 
plausible value analysis accounts for the estimation of proficiency. 

Deviating the school- and student-level covariates around the Michigan 
means allows us to obtain more precise estimates of various parameters for our 
own state, Michigan. 4 For the sake of simplicity, we forego the option of al- 
lowing any of the partial effects associated with student-level covariates to 
vary randomly from school to school within state k. Thus, only P ot , the inter- 
cept, varies randomly across schools within states. 

Befween-state Models 

The between-state synthesis combined the output produced by each state 
to obtain inferences on parameters for individual states as well as global pa- 
rameters. The output from the within-state analysis for state k consisted of the 
ML estimates b k of the state mean and its estimated sampling variance v k . The 
estimate b k is assumed to vary around its corresponding parameter (3^ with an 



The covariates can be deviated around other constants such as the national means for 
- ter purposes. 
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unique error r k associated with the sample for state k, which has a known sam- 
pling variance v k , i.e., 

2 - b k = p* + r k . r k - N (°- V- 

The parameter (3^ is in turn assumed to vary around an overall mean y plus a 
random error associated with state k, |i r We may write 

3 - . P* = Y + 

The random error has a variance of x. 

Table 2 lists the approximate posterior means and standard deviations of 
the various regression coefficients, and the estimates of between-state vari- 
ance and their square roots. 5 We computed z-ratios for the regression coefficients 
to evaluate the null hypothesis that a particular regression coefficient pooled 
across states was 0. A z-ratio larger than 2 or 3, as indicated by asterisks in 
table 2, lent support to rejection of the null hypothesis. 

Student demographics. Controlling for home environments and for classroom 
and school characteristics, the results suggest that, on average, males had.higher 
scores than females; and African Americans, Hispanic Americans, and Native 
Americans exhibited lower proficiency than did European or Asian Ameri- 
cans. For instance, African Americans obtained, on average, about half a 
standard deviation lower math proficiency than did European Americans. Net 
of other covariates, students who were bom in the United States scored higher 
that those who were not. The partial effects associated with the African Ameri- 
can and Hispanic American ethnicity and the place of birth variables seem to 
vary from state to state. 

Home environment. Controlling all other covariates, family structure, parental 
education, and home literacy environment were related to proficiency. Stu- 
dents who lived with either one parent or both parents outperformed those who 
did not and also those who did not know the educational levels of their parents. 
Students whose parents had education beyond high school and those whose 
parents had college degrees scored higher than did those whose parents had 
not graduated from high school. Furthermore, students coming from house- 



5 Table 2 gives the empirical Bayes summary results. Raudenbush et al. (in press) provide 
results from the fully Bayesian synthesis and compared the two sets of results. Individual 
q °tate results are available upon request. 
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Table 2. Empirical Bayes Summary of State- by-State Results 


Predictors 


Approximate 
posterior mean of y p 


Approximate posterior 
standard deviation of y p 


Estimate of between- 
slate variance, x p 


Square root of the estimate of 
befween-state variance, t„ 1/2 

P 


Demographics 


Male 


0.0904* 


0.0061 


0.0005 


0.0231 


African American 


-0.4583* 


0.0215 


0.0123 


0.1107 


Hispanic American 


-0.3894* 


0.0271 


0.0239 


0.1546 


Asian American 


0.1288* 


0.0216 


0.0032 


0.0565 


Native American 


-0.2162* 


0.0223 


0.0043 


0.0654 


Not born in U.S. 


-0.2369* 


0.0211 


0.0110 


0.1046 


Home environment 


Living with both parents 


0.2884* 


0.0116 


0.0021 


0.0456 


Living with one parent 


0.2500* 


0.0124 


0.0022 


0.0471 


Parental education- 










high school diploma 


0.0567* 


0.0082 


0.0008 


0.0273 


Parental education — 










more than high school diploma 


0.2455* 


0.0085 


0.0217 


0.2455 


Parental education — 










college degree 


0.2146* 


0.0125 


0.0041 


0.0638 


Hours watching TV 


-0.0404* 


0.0024 


0.0001 


0.0112 


Changed school in past 2 years 


-0.0640* 


0.0063 


0.0000 


0.0000 


Get newspaper regularly 


0.0277* 


0.0057 


0.0000 


0.0000 


More than 25 books in home 


0.2051* 


0.0092 


0.0000 


0.0000 


Get magazines regularly 


0.1006* 


0.0075 


0.0007 


0.0259 


Classroom characteristics 


Taking algebra 


0.9830* 


0.0201 


0.0141 


0.1188 


Taking pre-algebra 


0.3972* 


0.0159 


0.0083 


0.0912 


Teaching experience of math teacher 


0.0029* 


0.0006 


0.0000 


0.0000 


Math teacher majored in math 


0.0844* 


0.0121 


0.0038 


0.0844 
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Table 2. Empirical Bayes Summary of State-by-State Results (continued) 
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Math teacher majored in 
math education 


0.0823* 


0.0149 


0.0055 


0.0738 


Math teacher did graduate work 


0.0101 


0.0084 


0.0010 


0.0320 


Math teacher emphasized 
reasoning/analysis in class 


0.1373* 


0.0096 


0.0023 


0.0478 


School characteristics 


Median income 


0.0059* 


0.0007 


0.0000 


0.0000 


Instructional dollars per pupil 


0.0000 


0.0000 


0.0000 


0.0000 


Percent minority 


-0.0036* 


0.0000 


0.0000 


0.0000 


Urban location 


0.0140 


0.0143 


0.0014 


0.0380 


Rural location 


-0.0191 


0.0225 


0.0125 


0.1120 


Offering 8th grade algebra for 
high school credits 


-0.0425* 


0.0138 


0.0018 


0.0428 


Availability of computer 


0.0024 


0.0124 


0.0000 


0.0000 


School climate 


0.0378* 


0.0079 


0.0000 


0.0000 


Intercept 


Intercept 


0.0680 


0.0096 


0.0000 


0.0000 


* z-score > 3. 



holds that had more than 25 books in the home and received newspaper and 
magazines regularly had higher math proficiency than those who came from 
households that did not. There were statistically significant negative partial 
effects associated with time spent watching TV and changing school in the 
past two years. Three of the between-state variance estimates were 0. 



Classroom characteristics. Enrollment in algebra and pre-algebra were positively 
O * ated to math scores, all else being equal. Those who took algebra scored 
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about one standard deviation higher than the reference group, whose students 
took eighth grade math or other non-algebra course or who did not take any 
math course. Those who enrolled in pre-algebra scored about 0.4 standard de- 
viation higher than the reference group. Teaching experience, teacher subject 
matter expertise (as indicated, respectively, by majoring in math or math edu- 
cation), and emphasis on reasoning 6 were also positively correlated with 
proficiency in math, net of the effects of other covariates. 

School characteristics. School composition effects were manifest, net all other 
predictors, including student demographic background. In particular, school 
median income was positively related to proficiency, and percent minority was 
negatively related to proficiency. Thus, school social class and ethnic segrega- 
tion effects tend to reinforce differences based on individual social class and 
ethnicity. All else being equal, a favorable school climate was positively re- 
lated to proficiency. The estimated partial effect of school algebra was 
statistically significant and negative. Note that this effect represented the ex- 
pected difference in math proficiency between a student not taking algebra in a 
school that offered algebra and a student in a school that did not offer algebra. 
One implication of the predominantly negative effect across the states is that 
there are at least some students in schools not offering algebra who would have 
benefited from enrollment in an algebra course had they attended schools that 
did offer algebra. In addition, as taking algebra was, in general, the most pow- 
erful single predictor of proficiency, one must conclude that attending a school 
that offers algebra is related positively to math proficiency. 

In sum, the relevant covariates include indicators of student demographic 
status, home environment, and school composition; these relate to proficiency 
as expected. At the school level, a curriculum that includes opportunities to 
take high school algebra and a positive climate were linked to proficiency. At 
the classroom level, teachers’ subject-matter preparation, as indicated by hav- 



6 One would expect the level of reasoning to increase with teacher’s education (e.g., a 
teacher’s undergraduate major) and the difficulty of the course (e.g., an algebra course 
versus a general mathematics course). Emphasis on reasoning, teacher’s education, and 
course type thus may jointly influence math proficiency. To understand how these various 
predictors may be correlated with the math scores, two models were specified, one with 
and one without emphasis on reasoning entered as a predictor. The results showed that 
reasoning, independent of all other covariates, was positively related to math proficiency. 

O act, the estimates of other predictors remained nearly the same in the two models. 
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ing majored in math or in math education, and emphasis on mathematical rea- 
soning predicted elevated proficiency. 

Variance Reduction 

Figures 2 and 3 give the approximate marginal posterior for the variance 
for the intercept, that is, for var(p 0< ) = x for the unconditional (with no covariates) 
and conditional models. 7 Figure 3 shows unmistakable evidence of heteroge- 
neity between states (note that 0 is not a plausible value for x). However, there 
is considerable uncertainty about the magnitude of this heterogeneity. 

The math proficiency measure was on a scale with a mean near 0 and a 
variance of approximately unity. The posterior mean of x is .088, implying that 
about 8.8 percent of the variance in the outcome lies between states. However, 
x values as small as .04 and as large as . 14 are not improbable. Thus, it appears 
that from 4 percent to 14 percent of the variance in the outcome lies between 
states. 

Whereas figure 3 shows evidence of heterogeneity between states (note 
that 0 is not a plausible value for x) after controlling for the various measures, 
there is every reason to believe that the magnitude of this heterogeneity is 
small. The posterior mean of x is .018, implying that 1.8 percent of the vari- 
ance in the outcome lies between the intercepts of the states. Moreover, the 
unknown value of x is unlikely to exceed .03 or 3 percent of the total variance 
in the outcome. It appears that from .004 percent to 3 percent of the variance in 
the intercept lies between states after controlling for covariates. Thus, most of 
the state-to-state heterogeneity is explainable on the basis of covariates de- 
fined on students, teachers, and schools. This indicates, in general, that states 
with high mean proficiency tend to be advantaged on the relevant covariates 
and that these advantages account for most state-to-state variation in profi- 
ciency. 

Phase II: Inequality of Access to Educational Opportunity 

In terms of figure 1 , our “first phase” analysis found certain school re- 
sources (arrow A) and student background indicators (arrow B) to be quite 



O 



The figures are output obtained from the Bayesian synthesis (see Raudenbush, Fotiu, and 
Cheong [in press] for a description of the approach). 
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Figure 2. Estimated Posterior Distribution of T: Unconditional Model 




T 



consistently related to student achievement. Controlling for these, state differ- 
ences in achievement (arrow F) became small, perhaps negligible. This 
encouraged us to abandon further investigation of state means, whether ad- 
justed or unadjusted. Rather, we sought in Phase II of our investigation to 
examine state differences in school resources. Given the consistent association 
between advantaged home background and achievement, we were especially 
interested in the equity with which the school resources are distributed. We 
asked: “Does the distribution of school resources likely reinforce or counteract 
inequalities arising from home environment? Do states differ, not only in the 
provision of resources, but also in the equity with which they are distributed?” 

One product of this work is a different kind of “report card” for states 
than is typically made available to policymakers. The typical report card pro- 
vides unadjusted differences between states in academic proficiency. This typical 
report card, though conveying some useful information, can easily mislead. It 
tends to provide an overly negative portrayal of education systems in states 
with comparatively disadvantaged demographics and an overly rosy picture of 
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Figure 3. Estimated Posterior Distribution of X intercept: 
Conditional Model 




education in states with more advantaged students. Moreover, it provides little 
insight into ways in which policy changes might produce better outcomes. 

The report card we present compares states on educational opportunities, 
resources, or processes theoretically and empirically linked to outcomes. It 
reveals the equity with which these are distributed as a function of student 
social background and ethnicity. It therefore points the discussion toward in- 
terventions that would increase the quality and equity of education provision. 

In modeling the relationship between student demographic background 
and educational resources, our analysis strategy depended on whether the edu- 
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cational resource in question was measured dichotomously or continuously. 
Dichotomous resources included school course offering (1 = school offers high 
school algebra, 0 = school does not offer high school algebra) teacher educa- 
tion (1 = teacher majored in math, 0 = teacher did not major in math), and 
emphasis on reasoning in the classroom (1 = high, 0 = other). 

Model for the Continuous Outcome (Disciplinary Climate) 

The method of estimation for the model studying school climate involves 
a two-level hierarchical linear model (Bryk and Raudenbush 1992) with stu- 
dents nested within states. Robust standard errors were computed using the 
generalized estimating equation approach of Zeger, Liang, and Albert (1988). 
These standard errors are relatively insensitive to mis-specification of the vari- 
ances and covariances at the two levels and to the distributional assumptions at 
each level. State-specific effects were estimated via empirical Bayes (Morris 
1983; Raudenbush 1988). 

Specifically, we estimated a within-state model in which ethnicity, pa- 
rental education, and the ethnicity-by-parent interaction predicted school 
climate. Ethnicity was represented by four dummy variables and parental edu- 
cation by two dummy variables. Allowing for the ethnicity-by-parent interaction 
effect enabled us to model access to resources for each sub-group (e.g., Afri- 
can Americans of low, middle, or high parental education). We allowed 
coefficients for the parental education dummies and for African American and 
Hispanic American ethnicity to vary randomly over states, thus allowing state- 
by-state comparisons. Sample sizes of Asian Americans and Native Americans 
were, unfortunately, too small to allow such a fine-grained analysis. 

Models for the Dichotomous Resource Indicators 

The same explanatory model for the school climate was specified for 
each dichotomous outcome. In this case, however, we used a two-level logistic 
regression model, estimated by penalized quasi-likelihood (Breslow and Clayton 
1993), with robust standard errors. Such a model is equivalent to a 2 by 3 by 5 
by 41 contingency table with 2 levels of the outcome, 3 levels of parent educa- 
tion, 5 levels of ethnicity, and 41 levels representing states. 
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Results 

We now consider the degree of ethnic and social equality in access to the 
four resources of interest. Specifically, we ask the following questions for each 
resource indicator: 

1. Averaging within the 41 states participating in the TSA, to what extent 
does student social background, as indicated by parental education and 
student ethnicity, predict access to the resources? 

2. Does the degree of inequality in access vary by state? If so, how do the 
41 states compare? 

Results Averaged Across States 

School Disciplinary Climate 

Figure 4 gives the graph of the fitted model in which ethnicity and paren- 
tal education predict access to favorable disciplinary climate. The figure shows 
that higher levels of parental education are clearly linked to more favorable 
disciplinary climate. The near parallelism of the five lines (with the exception 
of the line for Native Americans, which is based on a comparatively small 
sample) reflects the absence of any statistical evidence of a two-way interac- 
tion involving parental education and ethnicity. There is a substantial significant 
vertical displacement between ethnic groups. Pairwise comparisons using a 
Bonferroni adjustment to control the family-wise Type I error rate at the 5 
percent level indicated four separate clusters of means (in descending order of 
magnitude): (a) European Americans; (b) Asian Americans and Native Ameri- 
cans; (c) Hispanic Americans; and (d) African Americans. Given that the school 
climate outcome had a mean of 0 and a standard deviation of 0.63, the differ- 
ences manifest in figure 4 are non-trivial in magnitude: About 0.20 standard 
deviation units separate those with parents having a BA from those whose 
parents were without a high school diploma; nearly half a standard deviation 
separates European Americans and African Americans. 

Access to High School Algebra 

Figure 5 plots the predicted probability of attending a school that offers 
high school algebra for eighth graders as a function of parental education for 
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Figure 4. Predicted School Disciplinary Climate as a 
Function of Parent Education and Ethnicity 




Parent Education 



each of the five major ethnic groups under study. We see that parental educa- 
tion is positively associated with the probability of attending such a school. As 
in the case of climate, the near parallelism of the five lines reflects the absence 
of any statistical evidence of a two-way interaction involving parental educa- 
tion and ethnicity. Again, we find a significant vertical displacement between 
ethnic groups. Pairwise comparisons using a Bonferroni adjustment to control 
the family-wise Type I error rate at the 5 percent level indicated three separate 
clusters of ethnic group probabilities (in descending order of magnitude): (a) 
Asian Americans; (b) European Americans, African Americans, and Hispanic 
Americans; and (c) Native Americans. The differences manifest in figure 2 are 
comparatively modest in magnitude. 

The regression coefficients for the predictors give the associated partial 
effects in terms of log-odds. Besides computing predicted probabilities based 
O ' - regression coefficients, one could compute odds ratios as well. For in- 



Synthesizing Results from the NAEP Trial State Assessment 



29 



Figure 5. Predicted Probability of Assignment to a School That 
Offers Algebra as a Function of Parent Education and Ethnicity 




Parent Education 

stance, the odds ratio of offering algebra for a school attended by a student 
whose parent had college education versus a school attended by a student whose 
parent had less than high school education is exp{d BA } = exp {-0.244} = 0.784. 

We now turn to two classroom-level resources for learning: teacher sub- 
ject matter preparation, as indicated by having majored in mathematics, and a 
cognitively stimulating environment, as indicated by an instructional empha- 
sis on mathematical reasoning. In both cases, we find that social background 
(as indicated by parental education) and ethnicity are linked to access to the 
resource. However, the findings are more complex than those reported above, 
in that a two-way interaction is manifest in the case of these two classroom- 
level resources. 

Teacher Preparation 

Figure 6 plots the predicted probability of encountering a math teacher 
who majored in math as a function of social background and ethnicity. The 
re shows that higher levels of parental education are linked to a higher 
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Figure 6. Predicted Probability of Assignment to a Teacher Who 
Majored in Math As a Function of Parent Education and Ethnicity 




probability .of encountering such a teacher. However, the magnitude of this 
relationship depends upon ethnicity. The link between social background and 
teacher preparation is strongest for Asian Americans and European Americans 
and weakest for African Americans, Hispanic Americans, and Native 
Americans. Equivalently, we can say that ethnic gaps in access to the resource 
are manifest, but are more pronounced at higher than at lower levels of 
parent education. 



Emphasis on Reasoning 



Figure 7 plots the predicted probability of encountering a math teacher 
who emphasizes mathematical reasoning during instruction. Again there is a 
positive relationship between parent education and this probability, but again 
the magnitude of this association depends upon ethnicity. The link between 
parental education and access to reasoning is strongest for Asian Americans 
and European Americans and weakest for the other three groups. Equivalently, 
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Figure 7. Predicted Probability of Assignment to a Math Teacher Who 
Emphasizes Reasoning As a Function of Parent Education and Ethnicity 




just as in the case of teacher preparation, we can say that ethnic gaps in access 
to the resource are manifest, but are more pronounced at higher than at lower 
levels of parent education. 

Summary 

In sum, we find evidence of ethnic and social inequality in access to all 
four resource indicators when averaging across the 41 states. Main effects of 
both ethnicity and social background generally parallel previous findings in 
predicting student achievement. Thus, just as high parental education predicts 
favorable outcomes, it also predicts access to schools with favorable climates, 
schools that offer algebra, teachers with training in mathematics, and class- 
rooms that emphasize reasoning. Similarly, ethnic groups disadvantaged in 
outcomes (African Americans, Hispanic Americans, and Native Americans) 
also encounter less access to these resources for learning. 
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State Variation in Access to Resources 

The pooled, within-state findings regarding social and ethnic inequality 
in access to a favorable school climate provide an “on-average” picture of in- 
equality in access to resources over 41 states. However, these on-average results 
poorly represent the picture that we find in many states. In fact, the data reveal 
substantial evidence of state variation. 

The case of school disciplinary climate illustrates the substantial varia- 
tion across states. Figure 8 plots 95 percent bivariate confidence ellipses for 
the 41 states where the vertical axis is social inequality (as indicated by mean 
gaps in school climate between students having parental education of BA and 
less than high school) and the horizontal axis is ethnic inequality (as indicated 
by mean differences between African Americans and European Americans. 8 
Four features of the scatter plot of ellipses are noteworthy: 

1. First, there is a rather strong negative relationship between parental 
education “gaps” and ethnicity “gaps.” That is, states with a high degree 
of social inequality tend to also exhibit a high degree of ethnic inequality. 
New York is a case in point; lying in the upper left quadrant, New York 
has a “parental education gap” of about 0.30 points (half a standard 
deviation) and an “ethnicity gap” of around 0.60 (a full standard 
deviation). 

2. Some degree of inequality is present in nearly all states. This inference 
is based on noticing that nearly the entire scatter of ellipses lies above 0 
on the vertical axis (indicating positive parental education effects within 
states) and below 0 on the horizontal axis (indicating that African 
American ethnicity is associated with lower levels of disciplinary 
climate). 

3. However, the magnitude of inequality varies quite substantially across 
states. There is a cluster of states near the origin (the point indicating 
equality on both parental education and ethnicity). There are also states 
far from the origin (e.g., New York, New Jersey, California, and 
Massachusetts), implying substantial inequality in access to favorable 
disciplinary climate in these states. 



The mean differences associated with social inequality are adjusted for ethnicity, and the 
mean differences associated with ethnicity are adjusted for parent education. The 95 
percent confidence ellipses are based on the empirical Bayes posterior distribution (Morris 
^ 53) of the parental education and ethnicity coefficients for each state. 



State-specific coefficient for Parent Education=Bachelor Degree (with increasing magnitude 
associated with increasing advantage of high parental education) 
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Figure 8. 95 Percent Bivariate Confidence Ellipses for the State- 
specific Coefficients Associated with Parental Education and 
African American Ethnicity (Outcome: Mean School Climate) 




State-specific coefficient for African American Ethnicity (with increasing 
magnitude associated with increasing African American disadvantage) 
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4. There is considerable overlap among the ellipses, making it hard to 
distinguish many pairs of states and, in fact, making pairwise 
comparisons confusing. However, the ellipses of any pair of states can 
be shaded (as Michigan’s ellipse in figure 8) to facilitate a desired 
pairwise comparison. Using computer graphics, it is easy to highlight 
any subset of states to generate clearer comparisons. 

The value of the ellipses is that they automatically communicate the de- 
gree of uncertainty about rankings among states. Consider, for example, 
Michigan and Ohio. Ohio is characterized by significantly greater ethnic in- 
equality than Michigan is, i.e., the gap between European Americans and African 
Americans in the disciplinary climates they encounter is statistically greater in 
Ohio than in Michigan, as indicated by the fact that the two ellipses do not 
overlap on the horizontal axis. However, the two states do not differ in social 
inequality, as indicated by the fact that their ellipses do overlap on the vertical 
axis. 

Excellence versus Equality 

It is also possible to plot “excellence” (high levels of a resource) against 
“equality,” as depicted in figure 9. The figure shows, for example, that New 
Jersey, though displaying a comparatively high degree of ethnic inequality, has 
one of the highest average levels of disciplinary climate. Equality is not a good 
thing if environments are equally bad; South Carolina and Mississippi exhibit 
low levels of inequality but also low average levels of disciplinary climate. 

For the other resources, the pooled results also poorly represent the de- 
gree of inequality in some states. Again, the data reveal substantial evidence of 
state variation. It is possible and generally useful to describe state-to-state varia- 
tion in access to these resources as we did in the case of school climate (figures 
4 and 5). However, a detailed discussion of differences among the 41 states on 
all resources goes beyond the scope of this paper. 

Conclusions 

The Trial State Assessment of NAEP reports mean student proficiency in 
a given subject for each of the participating states, broken down by ethnicity 
and parental education (c.f., Mullis et al. 1993). Although reports of state means 
are essential as part of an assessment of the condition of education in the U.S., 

ave argued in this paper that such state means, by themselves, are difficult 
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State-specific adjusted overall average (with increasing magnitude associated 
with favorable average levels of disciplinary climate) 
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Figure 9. 95 Percent Bivariate Confidence Ellipses for the State- 
specific Coefficients Associated with Intercept and African 
American Ethnicity (Outcome: Mean School Climate) 




State-specific coefficient for African American Ethnicity (with increasing 
magnitude associated with increasing African American disadvantage) 
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to interpret and even misleading. The means reflect an unknown mix of 
contributions from student demographics, school organization and process, and 
state policy. To supplement the reporting of means, we have proposed a 
reporting of the access that states provide to key resources for learning. Know- 
ing the extent to which states provide these resources to students of varied 
social background and ethnicity points toward sharply defined policy debates 
concerning ways to improve education. The results of our analysis are both 
substantive and methodological. 

Substantive Findings 

Our results indicate substantial inequality in access to resources, on aver- 
age, over the 41 participating states. Social background, as indicated by levels 
of parental education, is significantly related to access to a school with a favor- 
able disciplinary climate and a school that offers high school algebra for eighth 
graders. Social background also predicts the probability that an eighth grader 
will encounter a teacher who majored in mathematics and a teacher who em- 
phasizes reasoning during mathematics instruction. These effects of social 
background are adjusted for ethnicity. 

The results for ethnicity parallel those for social background, though they 
vary to some degree by the resource of interest. For example, with respect to 
school disciplinary climate, European Americans encounter, on average, the 
most favorable disciplinary climates; Asian Americans and Native Americans 
are next, followed by Hispanic Americans and finally by African Americans. 
The probability of attending a school that offers algebra is distributed a little 
differently: Asian Americans experience the highest probability of attending 
such a school; European Americans, African Americans, and Hispanic Ameri- 
cans are next most likely to attend such a school; and Native Americans have 
the lowest probability of attending such a school. These effects of ethnicity are 
adjusted for social background. The results for teacher preparation and em- 
phasis on reasoning are more complex: ethnic gaps in access are greatest at 
highest levels of parental education, with Asian Americans and European 
Americans having greater access than other groups to each resource. 

In sum, we have found substantial evidence of inequality in access to 
these resources as a function of social background and ethnicity. However, 
there is also substantial variation across states in the extent of inequality. While 
somf degree of both forms of inequality appears to exist in nearly all states, 
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inequality is much more pronounced in some states than in others. Moreover, 
the overall level of availability of each resource also varies from state to state. 
While a fine-grained analysis of state differences on all four resources would 
be of interest, such a study goes beyond the scope of the current paper. How- 
ever, we have suggested ways in which state differences might be examined. 

The policy implications of these findings vary as a function of the re- 
source in question. Whether a school offers algebra to eighth graders is amenable 
to direct influence by state and district policy. The key impediment to offering 
algebra in a given setting is cost. It is generally more costly for smaller schools 
than for larger schools to diversify their curricula. Similarly, hiring teachers 
with serious college-level preparation in mathematics is under the direct con- 
trol of policy, with cost again being a key impediment. 

Constructing a favorable disciplinary climate, in contrast, is only par- 
tially under the control of policymakers. Effective adult leadership in a school 
setting is arguably the primary ingredient in creating such a climate, though 
the active participation of students and parents is also required for success. 
Skill, knowledge, and commitment are required, and there is considerable un- 
certainty about how to foster the needed efforts. Similarly, a decision to 
emphasize reasoning is in the hands of the teacher, depending on the teacher’s 
knowledge, skills, and evaluation of student needs. Interventions to encourage 
instruction that emphasizes reasoning are currently widespread, but the out- 
comes of such interventions are inevitably uncertain. 

In sum, how information from a report such as ours ought to influence 
the policy debate will vary as a function of the kind of resource in question. 
Options for increasing access to certain resources must be evaluated in terms 
of cost and feasibility. Our primary point, however, is that systematically col- 
lected data on access to key resources, as a supplement to reports of mean 
proficiency, ought to constitute an important input into policy debates regard- 
ing educational reform. 

Methodological Implications 

The educational resources considered here clearly constitute a small sub- 
set of those that ought to be studied. We have reasoned that the resources of 
key interest are those suggested by prior theory and research and operationalized 
in NAEP. There should also be some evidence that the NAEP indicator of the 

© 
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resource relates as expected to key educational outcomes. The logic of this 
argument is to extend NAEP to include a wider range of possible resources 
than are now included and to take some pains to insure that the resource indi- 
cators achieve a modicum of construct validity. For example, it would be 
extremely useful to field-test and validate student reports of multiple indica- 
tors of student social background including parental occupation, and to construct 
and validate a scale for cognitive stimulation in the classroom based on student 
reports. Linking NAEP data to indicators of neighborhood demographic char- 
acteristics such as poverty concentration, housing density, and ethnic 
composition would strengthen inference by allowing control for residential 
context. And it would be exciting to include with NAEP a survey of teachers in 
order to construct school-level indicators, based on teacher reports, of norma- 
tive cohesion, expectations, collaboration, control, opportunities for learning, 
and school-level academic press. The availability of denser data at the level of 
the student, classroom, and school would provide a wider range of school re- 
sources than can now be studied, leading to a richer characterization of the 
association between student background and access to resources. 

A promising avenue for future research is to develop more sophisticated 
models to explain variation in access to key resources. School district wealth, 
urban versus suburban versus rural location, school size, per pupil expendi- 
tures, and school social composition may shape the probability that resources 
will become available to a student; and studying such predictors may shed 
light on impediments to increasing access and identify new targets for inter- 
vention by policy. Our broad recommendation is that, as we assess student 
progress in subject-matter proficiency, we also assess the extent to which the 
education system provides resources that support such student progress. 
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Educational research has been characterized, perhaps unfairly in recent 
years, by the inconsistency of its research results and by a lack of consensus 
across its broad and multidisciplinary research community (Wilson and Davis 
1994; Saranson 1990). The broad purpose of this conference is to help deter- 
mine how we can improve the consistency and accuracy of results in educational 
research so that we can build a base of knowledge widely accepted by this 
diverse research community and, more importantly, by teachers, principals, 
superintendents, and policymakers. To do so is a daunting task since education 
is one of the most complex topics addressed by social science. It is not surpris- 
ing that progress in this direction has been slow, given both the broad 
interdisciplinary basis and the inherent complexity of learning. 

We have proceeded with the hope that better nonexperimental data and 
more sophisticated model specifications and estimation techniques will even- 
tually bring consensus. In this paper we will suggest that simply improving the 
kinds of nonexperimental data currently collected, along with the associated 
statistical methodologies, will never be sufficient to achieve the kind of scien- 
tific consensus needed to effectively guide educational policies. 1 Research shows 
that the effects we are trying to measure are quite complex. They often appear 
to be nonlinear, sensitive to contextual factors, moderately correlated among 
themselves, and subject to selection bias within families and schools. More- 



1 Support for this work came from the Center for Research on Educational Diversity and 
Excellence (CREDE), the NAEP Redesign Research Program, the NAEP Secondary 
O lalysis Program, and Exxon Corporation. 
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over, some achievement effects are long-term, sustained long after an inter- 
vention has stopped; and some fade after a few years. These results may be 
only the tip of the iceberg, considering the complexity of the underlying devel- 
opmental phenomena we are trying to understand. 

This complexity places great demand on the quality of our data, the so- 
phistication of our model specifications, and the accuracy of our estimation 
techniques. One interpretation of the wide variation in measurements of the 
effects of most factors affecting student achievement is that our data, model 
specifications, and estimation techniques do not yet reflect much of this inher- 
ent complexity. When results vary, it is difficult to determine why one set of 
results should be trusted over another, since practically every measurement 
makes different assumptions or uses different model specifications and esti- 
mation techniques. The wide variety of data quality, assumptions, and 
specifications may introduce enough bias and randomness to produce incon- 
sistent effects across different data sets and model specifications. In this case 
the results should not be interpreted as “no effect,” but rather as inconclusive. 

We suggest that three research approaches will be necessary to lead reli- 
ably to research consensus: increasing experimentation, building theories of 
educational process, and improving our nonexperimental analysis. Further, we 
believe that future data collection and research should be guided by a strategic 
plan built upon experimentation. Such a plan would provide the necessary data 
to build theories of educational process and improve our specifications of models 
used in nonexperimental analysis. 

Experiments — if well designed, implemented, analyzed, and replicated — 
provide explanations that are as close to causal as possible in social science. 
Such experiments can provide the most accurate results for the effect of a par- 
ticular variable in a given context. Experiments can also play another, and 
perhaps more important, role in social science research — namely, helping to 
validate model specifications for nonexperimental data. A key theme of this 
paper is that future experimentation and data collection need to be directed 
toward both the building of theories and the improvement of our assumptions 
in analyzing nonexperimental data. In the long run, policy analyses will largely 
be dependent on improving nonexperimental analysis since experiments can 
never be counted on to solve all the complex and contextual effects present in 
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education. 2 Therefore, improving our confidence in the model specifications 
used with nonexperimental data is critical. 

The major thrust of this paper is to suggest that building scientific 
consensus will require a coherent research strategy. This strategy must be 
built upon increasing experimentation, developing theories of educational 
process, and improving confidence in nonexperimental analysis, if we are to 
achieve research consensus. In this paper we focus initially on the broad lack 
of agreement relating to the effects of educational resources and social and 
educational policy on children. Thirty years of research with nonexperimental 
data have led to almost no consensus on these important policy issues. We 
then focus on a narrower question, namely, the impact of resources on 
educational outcomes, particularly student achievement. This situation 
presents an interesting case study where a consensus based on the results of 
nonexperimental data once existed, only to be challenged recently by new 
experimental and nonexperimental research. 

We use the Tennessee class size experiment results to illustrate the pro- 
cess of deriving “rules” for model specification used in nonexperimental data 
involving class size. We then illustrate the process of building theories of edu- 
cational process related to class size effects and describe the role of such theories 
in building stronger consensus. Finally, we specifically focus on implications 
for the National Assessment of Educational Progress (NAEP) and other data 
collections and more generally suggest directions for future research and de- 
velopment (R&D) efforts to build a more solid foundation of knowledge for 
educational policymaking. 

Children’s Well-Being: The Ongoing Debate 

Federal, state, and local governments spend approximately $500 billion 
per year in social, educational, and criminal justice expenditures on the nation’s 



2 



Large-scale experiments such as the Tennessee class size experiment can be costly and 
take considerable time to plan, implement, and analyze. While more experimentation 
seems essential to making progress in educational research, educational research will 
"irobably never follow health research, where trials are needed for every new intervention 
before implementation. 
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children and youth (Office of Science and Technology Policy 1997). 3 The 
amount spent on children appears to have increased substantially over time 
(Fuchs and Rekliss 1992), although there is debate about the magnitude of the 
real increase in spending. Thus, an important set of public policy questions is 
associated with how effective this increased spending has been at improving 
the well-being of our children. Besides increased investment, there have been 
significant changes in families, communities, and schools that would be ex- 
pected to affect children’s outcomes. 

There is little scholarly consensus about the effects of expenditures on 
children or the effects from changing families, communities, and schools. For 
instance, scholars disagree about the impact of the War on Poverty and ex- 
panded social welfare programs (Hermstein and Murray 1994; Jencks 1992); 
they also disagree on whether increased school resources have raised student 
achievement levels (Burtless 1996; Ladd 1996a). There is disagreement about 
the way communities have changed for black families (Wilson 1987; Jencks 
1992) and whether the net effect on children of recent changes in the family 
has been positive or negative (Cherlin 1988; Zill and Rogers 1988; Fuchs and 
Rekliss 1992; Popenoe 1993; Stacey 1993; Haveman and Wolfe 1994, 1995; 
Grissmer et al. 1994). There is more agreement about the effects of desegrega- 
tion, although some dispute remains (Wells and Crain 1994; Schofield 1995; 
Armor 1995; Orfield and Eaton 1996). Finally, many small-scale, intensive 
early childhood programs appear, to produce significant short- and long-term 
effects, but there is disagreement about large-scale programs — how large the 
effects from attending kindergarten and preschool are and how long these ef- 
fects last (Barnett 1995; Karweit 1989). Recent evidence suggests that the 
cost-effectiveness of early childhood programs can depend critically on the 
characteristics of the targeted group, with significant net fiscal returns for some 
groups, but not others (Karoly et al. 1998). 



This estimate does not include the foregone taxes for deductions for children and day 
care. Besides public sector spending on children, approximately $560 billion is spent in 
the private sector on children, bringing the average public and private spending per child 
to approximately $15,000 annually. This amount is estimated assuming the cost of 
raising a child to age 18 to be approximately $150,000, with approximately 70 million 
individuals between the ages 0-18. Thus, annual expenditures are $150,000 x 
™ 000,000/18 = $560 billion. See United States Department of Agriculture (1997) for 
i i /^imates of the cost of raising children. 
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Despite the lack of consensus among the educational research commu- 
nity, dramatic changes are being proposed and are occurring in both social and 
educational policies, based on perceptions that past policies have failed. For 
instance, much of the movement toward more fundamental reform of public 
schools arises from perceptions that massive increases in resources in grades 
K-12 education over the last 25 years have resulted in declining — or at best 
stable — student achievement (as measured by scores on the Scholastic Achieve- 
ment Test [SAT] and NAEP scores) and that schools have particularly failed 
minority students. If so, a solid case could be made for restructuring school 
governance and incentive structures so that more effective utilization of re- 
sources might possibly occur (Hanushek 1994; Hanushek and Jorgenson 1996). 
However, new research is challenging this once widely accepted conclusion. 



Until the early to mid-1990s, the dominant research position among so- 
cial scientists was that school resources had little impact on student achievement. 
This counterintuitive view dated from the “Coleman report” (Coleman et al. 
1966). Influential reviews by Eric Hanushek (1989, 1994, 1996, 1999) also 
argued that evidence from over 300 empirical measurements provided no con- 
sistent evidence that increases in school resources raised achievement scores. 
It was suggested that a key reason for inefficiency in public schools was a lack 
of incentives (Hanushek and Jorgenson 1996). 

However, it would not be surprising that some money was spent ineffi- 
ciently, given that no definitive results emerged from educational research that 
could guide policymakers. At worst — if past resources can be shown to have 
had no effect on achievement — this finding can simply indicate the lack of 
guidance by good R&D. The lack of a critical level of R&D funding and criti- 
cal mass of high quality research may provide an explanation for inefficiency 
just as persuasive as the lack of incentives (Wilson and Davis 1994). 



4 The early sections of this paper draw heavily from four recent papers — Grissmer, 

Flanagan, and Williamson (1998a); Grissmer et al. (1998); and Grissmer, Flanagan, and 
Williamson (1998b): and Grissmer et al. (forthcoming). We have quoted liberally from 
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Hanushek’s original reviews did not group studies using the quality of 
data and specifications, type of intervention, or student or grade level (1989, 
1994). However, Hanushek refined his reviews, focusing on effects from per 
pupil expenditure and pupil/teacher ratio reductions and disaggregating stud- 
ies by grade level, level of aggregation, and model specifications (Hanushek 
1996, 1999). These later reviews still indicated that subsets of studies provide 
positive and negative coefficients in about equal numbers. One focus was on 
studies using a production function framework where the previous year’s test 
scores were used as controls. These models were judged by many to be the 
most likely to avoid bias. These models also showed balanced numbers of 
positive and negative coefficients. These results strengthened the conclusion 
that the nonexperimental evidence supported little effect from class size reduc- 
tions or additional expenditures. 

Subsequent literature reviews questioned the selection criteria used in 
Hanushek’s reviews to choose studies for inclusion and the assignment of equal 
weight to all measurements from the included studies. Two subsequent litera- 
ture reviews (Hedges, Laine, and Greenwald 1994; Krueger 1999a) used the 
same studies included in Hanushek’s reviews, but came to different conclu- 
sions. One study used meta-analytic statistical techniques for combining the 
measurements, which do not weigh each measurement equally (Hedges, Laine, 
and Greenwald 1994). Explicit statistical tests were made for several variables 
for the hypotheses that the results support a mean positive coefficient and re- 
ject a mean negative coefficient. The results concluded that, for most resource 
variables, the results supported a positive relationship between resources and 
outcomes. In particular, per pupil expenditures and teacher experience pro- 
vided the most consistent positive effects, with pupil/teacher ratio, teacher salary 
and teacher education having much weaker effects. 



A more recent literature review using the same studies included in 
Hanushek’s reviews also concludes that a positive relationship exists between 
resources and outcomes (Krueger 1999a). This review criticizes the inclusion 
and equal weighting of multiple measurements from single published studies. 
Some studies provided as many as 24 separate measurements due to the pre- 
sentation of sets of results for many subgroups. Since the average sample size 
will decline as subgroups increase, many of the measurements lacked the sta- 
tistical power to detect policy-significant effects; and thus many insignificant 
coefficients might be expected. Since the presentation of results for subgroups 
O t done uniformly across studies, and may even be dependent on the results 
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obtained, Krueger (1999a) reanalyzes the data to determine if the inclusion of 
multiple measurements significantly affects the conclusions reached. His analy- 
sis concludes that the inclusion of multiple measurements is a significant factor 
in explaining the original conclusions, and that less weight placed on these 
multiple measurements would lead to support for a positive relationship be- 
tween higher per pupil expenditures and lower pupil/teacher ratio and outcomes. 

A more comprehensive review of the literature prior to 1990 used meta- 
analytic statistical comparison techniques, but searched a wider literature and 
imposed different quality controls (Greenwald, Hedges, and Laine 1996). All 
the included studies used achievement as the dependent variable and measure- 
ments at the individual or school level only. The resulting set of measurements 
utilized in the study included many measurements that were not included in 
Hanushek’s studies and rejection of about two-thirds of the measurements in- 
cluded in Hanushek’s reviews. 

The conclusions analyzing the set of coefficients from six variables (per 
pupil expenditure, teacher ability, teacher education, teacher experience, pu- 
pil/teacher ratio, school size) supported statistically the hypothesis that the 
median coefficients from previous studies showed positive relationships be- 
tween resource variables and achievement. However, the variance in coefficients 
for each variable across studies was very large. Extreme outliers appeared to 
be a problem for some variables, and the coefficients across studies appeared 
to have little central tendency indicating the presence of nonrandom errors. 

This review also reported results for measurements using different model 
specifications (longitudinal, quasi-longitudinal and cross-sectional). 5 The re- 
sults showed that median coefficients changed dramatically for most variables 
across specifications, with no recognizable pattern. Although few studies had 
what were considered to have superior specifications (longitudinal studies), 
the median coefficients for these models were negative for per pupil expendi- 
ture, teacher education, pupil/teacher ratio, and school size. When the median 
coefficients of studies having quasi-longitudinal studies were compared to co- 
efficients from the entire sample, results were similar for four variables, but 
differed for the remaining two variables by factors ranging from 2 to 20. In the 



5 



O 



Longitudinal studies were defined as those having a pretest control score, and quasi-longitudi- 
nal was defined as having some earlier performance-based measure as a control. Cross- 
ectional studies merely had SES-type variables included as controls. 
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case of teacher salary, these studies provided a median coefficient indicating 
that a $ 1 ,000 salary increase could boost achievement by over one-half stan- 
dard deviation. 

This review utilized better screening criteria and better statistical tests to 
conclude that the overall evidence supported positive effects from additional 
resources. However, the large variance in coefficients and the sensitivity of the 
median coefficients to which studies were included provided little confidence 
that the literature could be used to estimate reliable coefficients. In particular, 
models thought to have superior specifications provided no more consistent 
results and sometimes provided noncredible estimates. 

Besides the argument from literature reviews, Hanushek made another 
argument that seemed consistent with his conclusions. Measured in constant 
dollars, expenditures per pupil doubled between the late 1960s and the early 
1990s; however, NAEP scores at age 9, 13, and 17 showed no dramatic im- 
provement in average reading or math skills during this period. We address 
this argument next. 

Interpreting NAEP Score Trends 

Achievement scores are a particularly good measure of the changing en- 
vironment for our children since research has shown that achievement reflects 
the combined influence of families, communities, and schools. Significant 
changes in the quality of our families, schools, and communities should be 
reflected on achievement trends that are best measured by NAEP (Cambell et 
al. 1996; Miller, Nelson, and Naifeh 1995; Mullis et al. 1993; Reese et 
al. 1997). 

The NAEP achievement scores collected from 9-, 13-, and 17-year-olds 
since 1969 are the only nationally representative achievement scores available. 
The primary purpose of NAEP has been to simply monitor the achievement of 
American students; however, NAEP scores are increasingly being used to evalu- 
ate the effects on youth from the dramatic changes in families, communities, 
and schools, and from our nation’s educational and social policies — changes 
that have taken place since the late 1960s. These changes include the follow- 
ing: 
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O National efforts to equalize opportunity and reduce poverty that began 
in the mid-1960s and continued or expanded in subsequent decades. 
These efforts included federally funded preschools (e.g., Head Start), 
compensatory funding of elementary schools with large numbers of low- 
income students, desegregation of schools, affirmative action in college 
and professional school admissions, and expanded social welfare 
programs for poor families. 

O Changes in school attendance and school changes that were not primarily 
designed to equalize opportunity. These changes included increased early 
schooling, greater per pupil expenditures, smaller classes, significant 
changes in the characteristics of teachers, and systemic reform initiatives. 

O Changes in families and communities that may have been somewhat 
influenced by efforts to equalize opportunity and reduce poverty but 
that occurred mainly for other reasons. Specifically, parents acquired 
more formal education, more children lived with only one parent, more 
children had only one or two siblings, and the proportion of children 
living in poverty rose. At the same time, poor blacks concentrated more 
in inner cities, while the more affluent blacks moved to the suburbs. 

The 17-year-olds tested by NAEP in 1971 would have grown up in fami- 
lies and communities and attended schools largely unaffected by the changes 
cited above. However, those recently tested would have lived their entire lives 
in families, communities, and schools reshaped by these policies. It would be 
hard to take a position about the quality of our families, communities, and 
schools and the effectiveness of social and educational policies that would be 
inconsistent with the trends in the NAEP data. 

Until recently, the NAEP scores were used only peripherally to address 
these kinds of questions, partly because the more widely recognized (but fa- 
tally flawed) SAT scores were used whenever test scores entered the public 
debate. One reason that SAT scores are used effectively in public debate is that 
the public appears to base its assessment of the quality of American schools on 
SAT scores (Grissmer forthcoming). Figure 1 shows the results of an annual 
public opinion poll that asks adults to grade the nation’s schools. The percent- 
age of adults giving schools an “A” or a “B” is graphed against changes in 
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annual average SAT scores. 6 The data show that public opinion appears to fol- 
low the SAT trends. 

The well-known flaws in the SAT scores for monitoring national achieve- 
ment trends result from their self-selected sample (Advisory Group on the 
Scholastic Aptitude Test Score Decline 1977; Koretz 1986, 1987; Rock 1987; 
Grissmer et al. 1994). The scores are biased downward, not only because of an 
increasing percentage of students taking the test but also because the students 
making the largest achievement gains from 1970 to 1990 — minority and dis- 
advantaged students — are largely missed by the SAT because they do not go to 
college. Ironically, if K-12 education improves, allowing more children to at- 
tend college, the SAT scores will decline. Thus, SAT scores are probably a 
perverse indicator of K-12 school quality. 

The research community switched to analyzing NAEP data in isolated 
studies dating from the mid-1980s. A steady stream of analyses from the late 
1980s drawn from the NAEP data developed into more detailed analyses using 



Figure 1. Comparing the trends in SAT scores with percentage of 
adults giving schools a grade of “A” or “B” 



SAT Score (mean of zero) Percentage grading school A or B (mean of zero) 




School Year 



6 Th e graph normalizes both variables to a mean of 0. The regression fit for the equation, 
r pw r^hool grade = a + b (Average SAT score), gives b = .79 (t = 5.2), R-Squared = .56. 
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new methodologies from the mid-1990s. 7 Early work took note of the large 
gains in black scores and the very small gains in white scores, along with the 
resulting convergence of the black- white test score gap. The contrast with fall- 
ing SAT scores was noted. However, familiarity with this earlier 
work — buttressed by the National Research Council (1989) — seemed to re- 
main confined to a small group of researchers, and declining SAT scores 
remained the dominant influence among both the public and the research com- 
munity. 8 

Starting in the early 1990s, analyses of the NAEP data began to provide 
more detail about differences in trends among black, Hispanic, and white 
students; differences in trends for lower- and higher-scoring students; differ- 
ences by age; and particularly differences by entry cohorts. The analyses 
also attempted to explain the trends and the convergence in the black-white 
test score gap. 

Across ages and subjects, the largest gains in scores occurred for black 
students; but significant gains were registered by Hispanic students and lower- 
scoring white students, with small gains or none registered by average and 
higher-scoring white students (Hedges and Nowell 1998; Hauser 1998; 
Grissmer etal. 1994, 1998; Grissmer, Flanagan, and Williamson 1998a). These 
studies also noted the evidence that black gains were largely confined to a 
group of about 10 cohorts born in the mid-1960s to the mid-1970s and enter- 
ing school around 1970 to 1980. For later cohorts, black scores and the 
black-white achievement gap have — for most age groups and subjects — re- 
mained stable or declined. 

The most striking feature of the NAEP results for blacks is the size of 
adolescents’ gains for cohorts entering from 1968-1972 to 1976-1980. These 



7 See Hauser (1998) for a history of utilizing NAEP scores from 1984 to 1992. This period 
included work by Jones (1984); Koretz (1986, 1987); National Research Council (1989); 
Linn and Dunbar (1990); and Smith and O’Day (1991). See Rothstein (1998) for a long- 
term history of achievement that extends through 1997. This paper draws from all of 
these studies. 

8 This phenomenon points to a second problem in attaining consensus in the educational 
research community. While small groups of researchers with in-depth knowledge in a 
subject may find consensus, it is quite another problem for this information to be 
disseminated, accepted broadly, and commonly cited in most research. The diverse set of 
oumals and disciplinary boundaries make it difficult for narrow consensus to become 

ERIC >road consensus. 
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gains were 0.6 standard deviation averaged across reading and math. Such 
large gains for very large national populations over such short time periods are 
rare, if not unprecedented. Scores on IQ tests given to national populations 
seem to have increased gradually and persistently throughout the 20 th century, 
both in the United States and elsewhere (Flynn 1987; Neisser 1998). But no 
evidence exists in these data involving large populations showing gains even 
close to the magnitude of the gains made by black student cohorts over a 10- 
year period. 

Even in intensive programs explicitly aimed at raising test scores, it is 
unusual to obtain gains of this magnitude. Early childhood interventions are 
widely thought to have the largest potential effect on academic achievement, 
partly because of their influence on brain development. Yet only a handful of 
“model” programs have reported gains as large as half a standard deviation 
(Barnett 1995). These programs were very small-scale programs with inten- 
sive levels of intervention. Even when early childhood programs produce 
large initial gains, the effects usually fade at later ages. Among blacks who 
entered school between roughly 1968 and 1978, in contrast, the gains were 
very large among older students and were not confined to small samples, but 
occurred nationwide. 

Beginning in the mid-1990s, finding the likely causes of these gains be- 
came the focus of research. Part of the quest was to determine whether the 
dramatic changes that occurred in families during this period could explain the 
gains. Utilizing data from several sources (Current Population Survey [CPS], 
the National Longitudinal Survey of Youth [NLSY], and the National Educa- 
tion Longitudinal Study [NELS]), one study developed a new methodology to 
estimate the size of the net expected gains from changes in eight key family 
characteristics for 13- to 17-year-old test-takers from 1970-90 (Grissmer et al. 
1994). The analysis required several assumptions — one concerning the stabil- 
ity of family coefficients in achievement equations over time. 9 The results of 
the analysis indicated that changes in the family would predict small positive 
gains in scores for all racial-ethnic groups and that these gains could account 
for the smaller score gains among whites but could explain only about one- 
quarter of the minority gains. 



^Evidence from Hedges and Nowell (1998) and Cook and Evans (1997) appears to support 
iV>-rly stable family coefficients over time. 
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Another analysis using NAEP individual level data also concluded that 
family effects could account for only one-quarter or so of black gains (Cook 
and Evans 1997). This analysis relied on student-reported family characteris- 
tics collected with the NAEP, but utilized a methodology newly imported from 
labor economics to attempt to partition the gains into those related to family 
changes, changes in family structural characteristics, and those due to changes 
between and within schools. If effects from changing family characteristics 
are small, the likely remaining hypothesis for the black score gains is school- 
related, community-related, or related to yet unmeasured family characteristics. 

Jencks and Phillips (1998) summarized research efforts focusing on the 
black-white test score gap. Their book brought together a diverse set of schol- 
ars to try to determine where consensus can be achieved on this topic and 
where and what kind of additional research is needed. 10 Three analyses re- 
ported in the book look at the convergence and possible divergence of the 
black- white score gap for cohorts bom as early as 1950 (Hedges and Nowell 
1998; Phillips, Crouse, and Ralph 1998; Grissmer, Flanagan, and Williamson 
1998a). Two of the studies utilize NAEP data as well as achievement and sur- 
vey data from other studies. All agree that significant narrowing occurred for 
cohorts bom prior to about 1978 — but no further narrowing occurred for later 
cohorts. 

Although the black-white gap for reading actually widened, Phillips, 
Crouse, and Ralph (1998) concluded that the widening is not statistically sig- 
nificant. Hedges and Nowell (1998) and Grissmer, Flanagan, and Williamson 
(1998a) provided evidence that family changes may explain a part of the nar- 
rowing. Further, Grissmer, Flanagan, and Williamson (1998a) observed that 
the timing of the black gains by age group and region suggested two major 
hypotheses for the gains. The first hypothesis was based on changes in school- 
ing — changing pupil/teacher ratios and class sizes, changing teacher 
characteristics, and changing curricula. Changing pupil/teacher ratios emerged 



10 In the process of achieving consensus, support for a continuing series of books dedicated 
entirely to exploring the most important questions in education seems crucial. Besides 
Jencks and Phillips (1998), Ladd (1996a) and Burtless (1996) are also good examples. In 
these latter books, the consensus might be characterized more by what is not known than 
vhat is known. 
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as a viable, but not completely satisfactory, explanation in other analyses 
(Krueger 1998; Ferguson 1998). 11 

A second explanation emerged, more closely related to the changes en- 
gendered by the Civil Rights movement and the War on Poverty. Such changes 
could have direct effects related to school desegregation — particularly in the 
South — and indirect effects caused by the perceived shift in the motivation for 
and attitudes toward education of black parents and students stemming from 
better opportunities for future schooling and jobs. An additional possible shift 
from these efforts could have occurred in the behavior and attitudes of teachers 
of black students that resulted in increased attention and resources. The timing 
of the black gains by age coincides with the broad-scale implementation of 
such efforts, if the assumption is made that most of the effects would occur 
only if students experienced these changes from the early grades forward. The 
large gains for minority and disadvantaged students, as well as the smaller 
gains (or lack of gain) among average and higher-scoring white students, pose 
a challenge to the thesis that the increased spending in education and social 
programs aimed at these students was ineffective. 

Analysis of NAEP scores appears to be central to the debates about changes 
in American families and schools, policies providing equal opportunity in edu- 
cation, and the best way to spend investments in education and children. The 
effective absence of these scores from these national debates has allowed many 
widespread beliefs to proliferate that seem to be at odds with the NAEP re- 
sults. The NAEP data do not suggest that families have deteriorated since 1970. 
Nor do they suggest that schools have spent money inefficiently or that social 
and educational policies aimed at helping minorities have failed. 

Instead, they suggest that family environments changed in positive ways 
from 1970 to 1996, that the implementation of the policies associated with the 
Civil Rights movement and the War on Poverty may be a viable explanation 
for large gains in black scores, and that certain changes in our schools and 
curriculum are consistent with NAEP score gains. While the NAEP scores 



11 The timing of pupil/teacher ratio changes would suggest that score gains should have 
started earlier and would affect white scores as well — leading to overpredicted white 
gains. Further research to determine whether class size for black students fell more than 
for white students might help reduce the overprediction of white score gains. This 
overprediction would also be addressed if class size reductions were small or nonexistent 




advantaged white students. 
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alone cannot reject the beliefs about deteriorating families and schools and the 
ineffectiveness of social and educational policies, the advocates of such be- 
liefs must provide an explanation for NAEP scores consistent with their 
positions. The NAEP scores from 1971 to 1988 generally support a more posi- 
tive picture of our families, schools, and public policies; however, trends in 
black achievement since 1988 to 1990 have been more discouraging, and it is 
critical to understand why these reversals have occurred. 



Trends in 




IResourees 



Research on NAEP scores shows that the increases were negligible only 
for the higher-scoring white population, but substantial for black, Hispanic, 
and lower-scoring white students. A second line of research using new data ; 
and new methods of estimating “real” per pupil expenditures over time shows 
that resource growth tended to occur where achievement gains were made 
(Rothstein and Miles 1995). 

A new method of deflating school expenditures, taking account of the 
labor intensity of schools, showed that resources did not come close to dou- 
bling as had been indicated by the commonly used Consumer Price Index (CPI). 
Use of more appropriate indices for adjustment of educational expenditures 
reflecting their labor intensity provides much lower estimates of real growth 
(Rothstein and Miles 1995; Ladd 1996b). 

Moreover, the new method — developed to assign school expenditures to 
programmatic categories that could distinguish spending on different types of 
students — showed that even this smaller increase overestimates the additional 
resources available to boost achievement scores for regular students. A large 
part of the smaller estimated increase went for students with learning disabili- 
ties, many of whom are not tested. 12 Another part also went for other socially 



LIS 



There is agreement that a disproportionate fraction of the expenditure increase during the 
NAEP period was directed toward special education (Lankford and Wyckoff 1996; 
Hanushek and Rivkin 1997). Hanushek and Rivkin estimated that about a third of the 
increase between 1980 and 1990 was related to special education. NAEP typically 
excludes about 5 percent of students who have serious learning disabilities. However, 
special education counts increased from about 8 percent of all students in 1976-77 to 
about 12 percent in 1993-94. These figures imply that 7 percent of students taking the 
NAEP tests were receiving special education resources in 1994, compared to 3 percent in 
1976-77. This percentage is too small to have much effect on NAEP trends, but it should 
n principle have had a small positive effect. 
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desirable objectives that are only indirectly related to academic achievement. 
Taking into account better cost indices, and including only the spending that 
would have been directed at increasing achievement scores, Rothstein and Miles 
(1995) concluded that the real increase in per pupil spending on regular stu- 
dents was closer to 30 than to 100 percent. 

These smaller additional expenditures for regular students are mainly 
accounted for by lower pupil/teacher ratios, increased teacher salaries due to 
more experienced and educated teachers, and compensatory programs that 
would be expected to benefit minority and lower income students (Rothstein 
and Miles 1995; Hanushek and Rivkin 1997). The key issue then becomes 
whether these resource increases can plausibly explain any part of the pattern 
of large black gains and the absence of white gains unaccounted for by family 
changes. This pattern might be explained if black students received dispropor- 
tionate shares of the additional resources or if black students benefited more 
than white students due to similar increases in resources. 13 

The Tennessee Experiment 

Important new evidence for challenging the view that money doesn’t 
matter comes from a large-scale experiment in Tennessee on the effects of 
class size. The Tennessee experiment in education was largely ignored for sev- 
eral years by the wider research community, and only recently has been 
reanalyzed and given its deserved prominence (Ritter and Boruch 1999). This 
experimental research suggests that reductions in class size may, in fact, have 
more impact on disadvantaged and minority students than on white students. A 
quasi-experiment in Wisconsin that varied student/teacher ratio also provided 
new evidence (Molnar et al. 1999). 

The first experimental evidence on the effect of major educational vari- 
ables came from a Tennessee study on the effects of class size (Word, Johnston, 
and Bain 1990; Finn and Achilles 1990; Mosteller 1995). About 79 schools in 



A number of policies sought to shift resources toward minority or low-income students 
during these years, including federal compensatory funding based on the percentage of 
children in poverty, school desegregation, and court-directed or legislative changes in 
state funding formulas toward minority and low-income school districts. However, other 
factors operated over this time period that could have increased funding for middle- and 
uDper-income children as well. It is still unclear whether the net effect has been to 
r ^ proportionately shift resources toward minority and lower-income children. 
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Tennessee randomly assigned about 6,000 kindergarten students to class sizes 
of approximately 15 or 23 students, and largely maintained their class size 
through third grade. Additional students entering each school at first, second, 
and third grade were also randomly assigned to these classes making the entire 
experimental sample approximately 12,000. After third grade, all students were 
returned to standard, large-size classes through eighth grade. The students in 
the experiment were disproportionately minority and disadvantaged — 33 per- 
cent were minority, and over 50 percent were eligible for free lunch. 

Analysis of the experimental data shows statistically significant, positive 
effects from smaller classes at the end of each grade from K-8 in every subject 
tested (Finn and Achilles 1999; Krueger 1999b; Nye, Hedges, and 
Konstantopoulos 1999; Nye, Hedges, and Konstantopoulos forthcoming). The 
magnitude of results varies depending on student characteristics and the num- 
ber of grades in small classes. Measurement of effect sizes from four years in 
small classes at third grade varies from 0.25 to 0.4 standard deviation (Krueger 
1999b; Nye, Hedges, and Konstantopoulos forthcoming). The current mea- 
surement of long-term effects at eighth grade show sustained effects of 
approximately 0.4 standard deviation for those in small classes all four years, 
but little sustained effect for those in smaller classes one or two years (Nye, 
Hedges, and Konstantopoulos 1999). Short-term effects are significantly larger 
for black students and somewhat larger for those receiving free lunches. 14 

Questions were raised whether the inevitable departures from experimental 
design that occur in implementing the experiment biased the results (Krueger 
1999b; Hanushek 1999). These problems included attrition from the samples, 
leakage of students between small and large classes, possible nonrandomness 
of teacher assignments, and schooling effects. Recent analysis has addressed 
these problems without finding any significant bias in the results (Krueger 
1999b; Nye, Hedges, and Konstantopoulos 1999; Nye, Hedges, and 
Konstantopoulos forthcoming; Grissmer 1999). It is possible for further analy- 
sis to find a flaw in the experiment that significantly affects the results, but 
extensive analysis to date has eliminated most of the potential problems. 



14 Long-term effects have not been reported by student characteristics. Following the 
experiment, Tennessee also cut class sizes to about 14 students per class in 17 school 
districts with the lowest family income. Comparisons with other districts and within 
districts before and after the change showed even larger gains of 0.35 to 0.5 standard 
O leviations (Word, Johnston, and Bain 1994); Mosteller 1995). Thus the evidence here 
ERIC uggests that class size effects may grow for the most disadvantaged students. 
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The Wisconsin SAGE (Student Achievement Guarantee in Education) 
quasi-experimental study differed in several important ways from the Tennes- 
see STAR experiment (Molnar et al. 1999). In the SAGE study, only schools 
with very high proportions of free-lunch students were eligible for inclusion. 
Assignments were not randomized within schools, but rather a preselected con- 
trol group of students from different schools was matched as a group to the 
students in treatment schools. The treatment is more accurately characterized 
as pupil/teacher ratio reduction since a significant number of schools chose 
two teachers in a large class rather than one teacher in a small class. The size of 
the reduction in pupil/teacher ratio was slightly larger than the class size re- 
ductions in Tennessee. 

There were about 1,600 students in the small pupil/teacher treatment group 
in Wisconsin, compared to approximately 2,000 students in small classes in 
Tennessee. However, the size of control groups differed markedly — around 
1,300 students in Wisconsin and around 4,000 in Tennessee, if both regular 
and regular-with-aide classes are combined. The SAGE sample had approxi- 
mately 50 percent minority students with almost 70 percent eligible for free or 
reduced price lunch. 

The results from the Wisconsin study for two consecutive first grade 
classes show statistically significant effects on achievement in all subjects 
(Molnar et al. 1999). The effect sizes in the first grade are in the range of 0.1- 
0.3 standard deviations. The lower estimates between 0. 1-0.2 occur in regression 
estimates, while the raw effects and hierarchical linear modeling (HLM) esti- 
mates are in the 0.2-0. 3 range. While the estimates seem consistent with the 
Tennessee study at first grade, more analysis is needed before the results can 
be compared. 

Learning From the Tennessee Experiment about 
Model Specification 

One of the problems with nonexperimental data analysis is that the re- 
search community usually fails to completely list the assumptions that are 
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required in any analysis to make the analysis equivalent to experimental data. 15 
Such listing of assumptions would make much more explicit the wide gap that 
exists between experimental and nonexperimental data analysis. 

Partly because there have been so few experiments in education, we have 
not paid much attention to their potentially critical role in shaping theories 
about education, helping to correctly specify variables and models using 
nonexperimental data, and specifying what data we should collect. If applied 
to reliable experimental data, models used to estimate nonexperimental data 
should be able to duplicate the experimental results. Krueger (1999b) suggests 
that production functions with previous year’s score do not duplicate the Ten- 
nessee effects except in the first year of smaller classes. This larger first-year 
effect has been interpreted as a socialization effect. 

The Tennessee results suggest several further specification issues. First, 
schooling variables in one grade can influence achievement at all later grades, so 
conditions in all previous years of schooling need to be present in specifications. 
Second, a pretest score cannot control for previous schooling characteristics. 
The Tennessee results suggest that two students can have similar pretest scores, 
similar schooling conditions during a grade, and emerge with different posttest 
grades influenced by different earlier schooling conditions. For instance, despite 
having similar schooling conditions in grades 4-8, relative changes in achieve- 
ment occurred in those grades for those having one to two or three to four years 
in small classes in K-3. Another way of stating this analytically is that effect 
sizes at a given grade can depend on interactions between this year’s schooling 
characteristics and all previous years’ characteristics. 

The production function framework using pretest controls assumes that 
any differences in pre- and posttests are captured by changed inputs during 
the period. The Tennessee results suggest that coefficients of such specifica- 
tions are un-interpretable from a policy perspective since the effect of a change 
in resources during a period cannot fully be known until past and future school- 



15 






An excellent counterexample is Ferguson and Ladd (1996), which starts to describe the 
conditions for a gold standard” model and provides one of the most complete listings of 
assumptions of any economic analysis. Raudenbush and Wilms (1995) and Raudenbush 
(1994) also carefully outline the statistical assumptions in two kinds of models used in 
education. See also Heckman, Layne-Farrar, and Todd (1996) for an analysis that tests 
and provides evidence of the weakness of the assumptions inherent in a certain kind of 
odel linking educational outcomes to educational resoi0e0 
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ing conditions are specified. Thus the answer to the question of whether a 
smaller class size in second grade had an effect cannot be known until later 
grades, and the answer will depend on what the class sizes were in previous 
and higher grades. 

Another interpretation of the Tennessee data is possible — namely, that 
reduced class size is a multiyear effect whose precise pattern is dependent on 
duration. Being in a small class not only raises short-term achievement in the 
current year, but also has an effect in succeeding years. Then the effect in first 
grade consists of a residual effect from kindergarten plus an independent first 
grade effect. The second grade effect is the sum of the residuals from kinder- 
garten and first grade, plus an independent second grade effect. This explanation 
would account for the increasing effect with more years in small classes in the 
K-3 years — but would also account for the pattern after return to larger classes 
after third grade. Clearly there is residual, and continuing, effect from having 
attended smaller classes in grades K-3. However, the permanence of the effect 
depends on duration, indicating the effects are not simply additive. 

Conceptually this makes the effect of class size reductions resemble a 
human “capital” input that can change output over all future periods, and mod- 
els specifying the effects of capital investments may be more appropriate. 16 
Production functions generally assume constant levels of capital, but children’s 
human “capital” is probably constantly changing and growing. 

From the standpoint of child development, these results are consistent 
with the concepts of risk and resiliency in children (Masten 1994; Rutter 1988). 
Children carry different levels of risk and resiliency into a given grade that 
appear to interact with the schooling conditions in that grade to produce gains 
or losses. For instance, four years of small classes appear to provide resiliency 
against later larger class sizes, whereas one year or two years do not. 

Few, if any, previous studies have included variables for prior years’ school 
characteristics from elementary school. At the individual level, virtually no 
longitudinal data from kindergarten were available. At more aggregate district 



16 Production functions are typically applied to model complete growth cycles in agriculture 
or other areas. We have tried to apply it to much smaller increments of growth in children 
by using pre- and post-test results. Production functions may have done less well in 
earlier studies predicting weekly plant growth as oppused to the complete cycle of growth 
© :r a season. 
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and state levels, data are usually available describing average characteristics 
for earlier years, but were probably seldom used. 

Since most data sets at the individual level, such as NELS, do not contain 
the previous year’s history for grades K-8, they cannot be used to estimate 
class size effects under this hypothesis. 17 Probably most previous measure- 
ments at the individual level have not had such data, and this might explain the 
downward bias in results. However, models using aggregate data have more of 
a chance at being able to include previous history — on average — for students 
in the sample. For instance, at a school district level, data would be available 
on class sizes in previous years. If no in-migration and out-migration occurs, 
then the average class size for district students can be determined for previous 
years. Migration will weaken the validity of these estimates, which means that 
higher levels of aggregation (state level data) will likely capture more accu- 
rately the historical class size for students in the aggregate sample. 

The usual tendency for researchers is to trust the results of individual 
level analysis more than those of aggregate level analysis. This trust arises 
from several factors: larger sample size, more variance in variables, and some- 
times more detailed family data. However, individual level analysis is to be 
preferred over aggregate level only if the quality of variables is equivalent. If 
aggregate level data can better capture accurate historical information, then 
these estimates may produce better results. Another implication is that our 
data collection efforts should focus on longitudinal data from early years. Use 
of longitudinal data beginning at or prior to school entry can sort out some of 
the specification problems that may exist in previous analyses. 

There are two new sources of such longitudinal data that will include 
school, teacher, and family characteristics and achievement data. First, there 
are the newly emerging longitudinal state databases that link student achieve- 
ment across years. Such data have very large sample sizes, and linkages are 
possible with teacher data and school characteristics. These data will be better 
able to address some of the potential specification issues involving dependence 



17 The current year’s class size will work if it is highly correlated with all past years’ class 
sizes. However, at the individual level it seems likely that the random elements that 
determine year-to-year class size — including in-migration and out-migration and 
decisions when to create additional classes — would not make this year’s class size a 
Particularly good predictor of previous years’ sizes, particularly over many grades, 
(owever, this correlation should be explored. q 



64 



David W. Grissmer and Ann Flanagan 



of later achievement on the previous year’s class size as well as thresholds and 
interactions with teacher characteristics. It may also be possible to determine 
class size effects in later grades as well as in early grades. The second source 
will be the Early Childhood Longitudinal Study (ECLS) funded by the U.S. 
Department of Education, which will collect very detailed data on children, 
their families, and their schools. These data will be much richer in variables, 
but much smaller in sample size than the state data sets. 

A Weak Test of the Hypothesis Using State 
NAEP Data 

Analysis of state NAEP scores is providing preliminary supportive evi- 
dence that certain state policies do matter in improving scores, that minority 
and disadvantaged students show the most gain from increased resources, and 
that the distribution of key resources is inequitable (Grissmer et al. forthcom- 
ing; see also Raudenbush in this volume). 

We have used the state NAEP data for the seven reading and math tests 
given between 1990 and 1996 at the fourth or eighth grade level to test two 
hypotheses: 

❖ whether aggregate state results provide estimates of pupil/teacher ratio 
that are in reasonable agreement with the Tennessee class size effects; 
and 

❖ whether these results change when we utilize a pupil/teacher ratio 
variable incorporating only the current year of the NAEP test vs. the 
average of all previous years in school. 

Estimates have been made using the 271 average state scores in equa- 
tions controlling for the effects of different family and demographic 
characteristics of students across states (Grissmer et al. forthcoming). We have 
utilized three different ways of controlling for family characteristics at the state 
level. We have supplemented the NAEP family characteristics with Census 
data to derive more accurate family variables than those provided by NAEP 
(Grissmer et al. forthcoming). We have also utilized SES-like variables de- 
rived from the NELS and Census data. We found little difference in results 
across these family measures. 

O 
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We used a random-effects model and estimated with the generalized lin- 
ear estimator with exchangeable correlation structure, which takes account of 
the lack of independence of state observations across tests (produces robust 
standard errors), the unbalanced panels, and heteroskedascity. We also have 
made estimates with generalized least squares and maximum likelihood, achiev- 
ing almost identical results. 

In the equations linking average state scores to family and state educa- 
tional characteristics, we included four educational variables that account for 
95 percent of the variance in per pupil spending across states. These variables 
are average teacher salary, pupil/teacher ratio, teacher-reported adequacy of 
resources, and percentage of students in a state in public prekindergarten. 18 We 
found the expected signs and statistical significance for pupil/teacher ratio, 
teacher-reported resources, and prekindergarten participation. We found insig- 
nificant results for teacher salary. 

The pupil/teacher ratio effect in this model would predict a rise of about 
0.14 standard deviation for reduction of eight pupils per class (approximately 
the size of the Tennessee class size reductions). This effect is markedly smaller 
than the reported Tennessee class size effect of around 0.20-0.25. However, if 
we include in our models an interaction term allowing larger pupil/teacher 
effects for states with more disadvantaged students, we find markedly larger 
effects for states having more disadvantaged students. The Tennessee experi- 
mental sample contained a disproportionate percentage of minority and free 
lunch students, compared to all Tennessee students (Krueger 1999b). If we 
take into account the characteristics of the Tennessee sample and the interac- 
tion effect, the equations would predict a class size effect for the Tennessee 
sample that agrees with the actual effect. 

We have tested whether results for pupil/teacher ratio differed in our data 
set when the variables were defined using pupil/teacher averages during time 
in school vs. pupil/teacher value in the year of the test only. We use the state 
average pupil/teacher ratio during all years in school, the average during grades 
1 through 4, and the value in the year of the test. The estimates for these vari- 



18 We used a pupil/teacher variable rather than class size, since data were only available by 
year by state for the pupil/teacher ratio. While the two are highly correlated, one cannot 
"ecessarily assume that reductions in pupil/teacher ratio and class size would produce the 
^ j^j^nme effects. 
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Table 1. Comparing Three Pupil-Teacher Coefficients That Incorporate 
Differing Information about Previous Grades 



Variable 


Random effect 

coef t-value 


Fixed effect 

coef t-value 


Average PR during school years 


-0.015 


-2.41 


-.014 


-1.16 


Average PR in grades 1-4 


-.020 


-2.69 


-.026 


-2.60 


PR in year of the test 


-.008 


-1.32 


.014 


1.57 



ables are shown in table 1 for random and fixed effect models. The results 
show that including current year pupil/teacher ratio instead of information from 
previous years causes the coefficients generally to weaken in both random and 
fixed effect models and to change signs in one model. 

The Investments That Do Matter 

The long debate about the role of resources in education has finally shifted 
from whether money does matter to what kinds of investments do matter for 
what kinds of children. The earlier conclusions drawn from reviews of the 
nonexperimental literature (Hanushek 1994) — that money has not mattered 
due to the inefficiency of our public school system and its lack of incentives — 
appear flawed. Over the last 25 years, money invested in schools for regular 
education students has gone mainly to develop programs targeted at minority 
and disadvantaged youth, lower pupil/teacher ratios, and raise average teacher 
salaries. Evidence is emerging that at least two of these investments have paid 
off for minority and disadvantaged students — lowering pupil/teacher ratios and 
targeting resources to minority and disadvantaged children. However, at least 
part of the money used to reduce pupil/teacher ratio for students from families 
with higher SES levels — the majority of students — may have been spent inef- 
ficiently. 

Still, the broad-ranging conclusions that money does not matter in edu- 
cation without substantial changes in the existing structure of and incentives in 
public education are contradicted by experimental evidence and the results 
presented here. Moreover, the evidence supporting these conclusions now ap- 
pears to be based on poor model specifications. This leaves the more viable 
hypothesis — that money does matter if invested in the right programs and tar- 
geted toward minority and disadvantaged students (Grissmer, Flanagan, and 

Williamson 1998b; Grissmer 1999). 

O 
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Implications for Future Methodology and Data Collection 

We suggest several specific ways that data and methodology might be 
improved, as follows: building micro-level models of educational processes, 
conducting more experiments, and improving NAEP data. For the latter, we 
discuss such measures as using school district samples rather than school 
samples, collecting additional family variables, improving children’s responses 
(especially with regard to reporting levels of parental education), collecting 
additional information from teachers, using state Census data to improve the 
individual level variables, using supplementary data from the Census, and col- 
lecting additional parent information. 

Building Micro- 1 eve I Models of Educational Processes 

The results of either experimental or nonexperimental analysis are meant 
to provide the material for developing theories of educational processes and 
student learning that gradually incorporate wider phenomena in their purview. 
Eventually, these theories should accurately predict the results of empirical 
work and be able to make new predictions to guide future empirical work. 
Theories by their very nature are more robust than any set of experimental or 
nonexperimental studies since they incorporate results of multiple measure- 
ments and incorporate research across levels of aggregation. However, little 
theory building has been done in education. 

Hierarchies exist in science whereby certain areas of science are derived 
from and built upon the knowledge in more basic science. For example, the 
science of chemistry relies partly upon basic knowledge in physics for expla- 
nation. The science of biology is partly built from knowledge of chemistry; 
and, within biology, molecular biology provides some basis for the applied 
science of medicine. Typically the ordering of these hierarchies is derived from 
the size of the basic building blocks studied. Physics studies elementary par- 
ticles and atoms. Chemistry studies combinations of atoms. Biology studies 
complex combinations of atoms with certain structures (genes, etc). 

Education is far up in the hierarchies of social science, ft rests upon 
knowledge derived from psychology, cognitive and brain science, genetics, 
sociology, child development, psychopathology, and economics, ft is one of 
the more complex “sciences” that depends on good basic science in the lower 
hierarchies. Without linking the knowledge from these more basic sciences, 
O 
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educational research will never have a solid foundation. Educational research 
needs to incorporate the findings of these more basic sciences in building its 
theories, data collections, and methodologies. An example is the need to un- 
derstand why smaller class sizes seem to produce higher levels of student 
achievement and why the results are multiyear and can be either short- or long- 
term. 



Research directed toward measuring class size effects has generally 
treated the classroom as a black box in which only inputs and outputs are needed 
and in which knowledge of the transforming processes inside are unimportant 
for purposes of measurement. The current analytical methods also isolate the 
cause and effects of class size reductions within precise time periods in a way 
that seems at odds with the more continuous, cumulative, and often delayed 
effects that occur in children’s cognitive development. Reconciling the differ- 
ences in experimental and nonexperimental evidence will probably require a 
far better understanding of the underlying mechanisms occurring in classrooms 
and the developmental process in students that determine achievement. 

In the case of class size, we need a theory of classroom and home behav- 
ior of teachers, students, and parents that answers why smaller classes might 
produce higher achievement in both the short and the long term. Initially we 
need to understand what teachers and students do differently in large and small 
classes and then whether these differences can be related to the size of short- 
term achievement. Perhaps the more difficult area of theory will be to explain 
gains long after the end of an intervention. An early intervention either has to 
change cognitive, psychological, or social development in important ways or 
change the future environment (e.g., peers, families) that affects the individual. 
Possibilities range from changes in brain development to learning different 
ways of interaction with teachers and peers to developing different study hab- 
its to being in different peer groups years later. 

Answering these types of questions not only requires different types of 
data collection, but also requires understanding much about psychology, child 
development, and individual behavior (teacher and student). We provide some 
simple examples in the appendix at the end of this paper of the types of model- 
ing and data collection that spring from alternate hypotheses about why smaller 
class sizes work. 

O 
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One type of theory-building would use time on task as a central organiz- 
ing concept in learning. A secondary concept involves the productivity and 
optimal division of that time among the different alternatives: new material 
through lectures, supervised and unsupervised practice, periodic repetition, and 
review and testing. 19 Students have a wide variance in the ways they spend 
time in school and at home, and it is likely that home time can substitute for 
specific types of teacher time. 

Some research suggests that significant differences may exist in the 
amount of instructional time and the ways in which it gets used across differ- 
ent types of classes and different teachers and by students with different 
characteristics (Molnar et al. 1999; Betts and Shkolnik 1999a; Rice 1999). A 
theory of learning needs to be developed that incorporates school and home 
time and the various tradeoffs and differences that exist across teachers, class- 
rooms, and SES levels. Such a theory would generate a number of testable 
hypotheses for research, which would then allow better and probably more 
complex theories to be developed. Such theories would then provide guidance 
as to what research is important to undertake. 

Such theory-building would mandate linking several disparate and iso- 
lated fields of research in education. There is micro-research involving time on 
task, repetition, and review in learning specific tasks. There is research on 
teachers in classrooms. There is research on homework and tutoring. There is 
research on specific reading and math instructional techniques. There is re- 
search on class size and teacher characteristics. Theorists can begin to understand 
these disparate areas and suggest theories that can explain the empirical work 
across these areas. Such linkages seem essential to future progress. 

Finally, cognitive development may have patterns of development simi- 
lar to other areas of development in children, since brain development seems 
to be central to each type of development. There is much research on patterns 
of physical, emotional, and social development in children from birth, differ- 
ences across children, delays in development, and dependence on previous 
mastery. Studies involving long-term developmental outcomes — especially 
for children at risk — identify resiliency factors that enable development to oc- 
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cur even in highly risky situations. Much can be learned from this literature 
that can help prevent researchers from making poor modeling assumptions. 

Need for Experiments 

A major question raised by many other researchers, and currently under 
discussion, is the role of experimentation in educational research and other 
areas of social science (Burtless 1993; Boruch and Foley 1998; Boruch 1997; 
Hanushek 1994; Heckman and Smith 1995; Ladd 1996a; Jencks and Phillips 
1998). Many interesting and complex issues arise in thinking about future ex- 
perimentation, but consensus is emerging on the need for more experimentation 
in education. 

Certainly, the value of the Tennessee experiment suggests that a selected 
number of social experiments may considerably add to our consensus knowl- 
edge in education. Besides the accuracy of the direct results, experiments tell 
us how to get more reliable results from nonexperimental data. Although ex- 
pensive to carry out, experiments may be cheap compared to the costs of 
ineffective educational policies. 

However, experimentation is much easier in smaller settings than in the 
classic, large-scale social experiments such as that produced in Tennessee. A 
very simple set of experiments could be designed around classroom- and school- 
level variables that would be much easier to carry out, yet could provide a 
better underlying base of information on which to build educational theories. 
For instance, simple experiments that divide children who miss a particular 
test question into two remediation groups with retesting could help locate the 
cause of missed questions and help develop efficient methods of remediation. 

Improving the NAEP Data 

The NAEP data are becoming so central to issues in both educational and 
social policy that priority should be given to significant expansion and im- 
provement. We address two issues with respect to the NAEP data: (1) redesign 
of the sample to be district- rather than school-based and (2) improving family 
variables. 



A School District NAEP Sample 



The hypothesis suggested here implies that the lack of historical data on 
:>ling variables may prove to be a barrier to unbiased results with indi- 
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vidual level NAEP data. Here we focus on one option that would improve the 
aggregate data analysis possible with NAEP data. If NAEP could become a 
school district sample rather than a school sample, then historical data from 
school districts (not available at the school level of aggregation) could be used 
in the formulation of variables. 

A district level sample would also result in improved family variables in 
NAEP data, since Census data would be available for most school districts. 
Currently, family variables in NAEP cannot be improved with Census data at 
the school level because privacy concerns prohibit their use within school ar- 
eas. A school district sample would also address another NAEP 
deficiency — namely, the absence of several educational policy variables not 
available at the school level, such as per pupil spending. A much wider and 
better defined set of educational policy variables is readily available at the 
school district level and is already collected. Thus, a school district, rather than 
school level, NAEP sample would be desirable from the standpoint of improv- 
ing family controls and educational policy variables. 

A straightforward random sample of students at the district level would 
involve additional administrative costs, because the districtwide student uni- 
verse would be needed and administration of tests would have to occur across 
many schools or involve assembling students from many schools in a central 
location. Such a sample would also have the disadvantage that, while Census 
and educational policy data would be available at the district level, certain 
school level characteristics obtained from student data at the school level would 
be missing. For instance, the school level sample of students is often used to 
define the characteristics of peers and their families. So a trade-off would oc- 
cur with a district sample in that the educational and family characteristics 
would improve, but less would be known about some of the local, school level 
characteristics. Much of this missing school level data could probably be col- 
lected using enrollment data available at the school level. For instance, instead 
of using the sample of 20 students per school to estimate percentage minority, 
this figure would be obtained from schoolwide enrollment data. 

Another change that would occur with a district sample would be that 
the sample of teachers surveyed would increase substantially. Currently, a typical 
classroom sample is 10-25 students, and a single teacher survey is collected. 
In a district sample, there would be few students selected from the same class- 
Q n, so the teacher sample would approach more closely the size of the student 
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sample. The larger teacher sample would have some advantages besides in- 
creased size. The desired teacher variables are the characteristics of all teachers 
of the students from the time they entered school. The current teacher sample 
of one to two teachers per school, since it is a very small sample, is a very 
weak proxy for the characteristics of teachers at the school or the characteris- 
tics of all previous teachers of the students. Obtaining a much larger sample of 
teachers at the district level would provide a better proxy for the kinds of teachers 
likely to have taught in the district. 

It may be possible to combine school level and district sampling to obtain 
a reasonable sample for each. About one-half of public school districts have 
fewer than 1,000 students and only one or two elementary schools per district. 
Thus, this sample of school districts would be close to the size of a school sample. 
However, these districts constitute only about 6 percent of total students. At the 
other end of the spectrum, there are about 300 districts with over 20,000 stu- 
dents, which account for nearly one-third of all students. In these districts, the 
number of schools ranges from about 30 to over 600. In most of these districts, a 
district sample could be drawn based on samples of schools, with 5-10 students 
per school. The remaining 60 percent of students are in school districts where 
some limited clustering by school could occur, but a sound district sample would 
probably have to include students from most schools. 

However, it may be feasible to design a joint district- and school-based 
sample that samples fewer students per school. Such a sample would have 
several analytical advantages. It would contain an additional hierarchy in the 
sample — the district level, where extensive and better data exist on families 
and schools. It could still contain school-based samples, but with fewer stu- 
dents per school. It would also enlarge the number of teachers surveyed. Such 
a sample design would, however, entail additional costs since more schools 
would be sampled, district samples would require more effort at developing 
universe files, and more teachers would be surveyed. 

The question is whether the analytical advantage would be worth the ad- 
ditional cost. To answer this question, we suggest a two-stage feasibility analysis 
in which a preliminary assessment by a group of statisticians and researchers 
would be performed to see whether serious barriers exist, to develop prelimi- 
nary cost estimates, and to better define the analytical advantage. This group 
would either recommend a more detailed study and assessment or make the 
‘ q nent that the analytical advantage is probably not worth the cost. 
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One of the chief advantages of moving to a district sample is that com- 
parisons of scores could be made for major urban and suburban area school 
districts. It is the urban school systems that pose the largest challenge to im- 
proving student achievement, and being able to develop models of NAEP scores 
across the major urban school districts could provide critical information in 
evaluating effective policies across urban districts. The sample sizes would be 
much larger than at the state level and could be expected to provide more reli- 
able results than for states. 



Improving Family-level Variables 

The primary objective of NAEP has always been seen as monitoring trends 
in achievement rather than explaining those trends. One result of this philosophy 
is that few family variables have been collected with NAEP. Compared with 
family data collected with other national achievement data or on other govern- 
ment surveys dealing with children’s issues, NAEP collects very few family 
variables. In addition, the quality of the family variables collected has always 
been questioned since they are reported by the students tested. The perception of 
weak family variables may partially explain why NAEP scores have not been 
utilized more frequently in research on educational and social policies. 



We have compared the accuracy of NAEP family data with Census data at 
the state level and analyzed the sensitivity of our estimates with state NAEP data 
with NAEP variables, Census variables, and SES variables formulated from par- 
ent-reported NELS data (Grissmer et al. 1 998). Not surprisingly, we find that NAEP 
variables for race and family type (single-parent or two-parent) match Census data 
well, once differences in the samples are accounted for. However, students sub- 
stantially inflate their parents’ education level at the college level. Fourth graders 
report 58 percent of their families include a college graduate compared to 26 re- 
ported in the Census; comparable figures for eighth graders are 42 percent compared 
to 25 percent in the Census. However, reports of “high school only” and “not a 
high school graduate” are much more accurate. Students appear to be unable to 
distinguish between “some college” and “college graduate” — and individuals us- 
ing NAEP data should combine these two categories when using the data. 

There are several ways that the family variables can be improved in the 
NAEP data collection. We describe six increasingly complex options. 



Collecting additional variables from children. There are two variables that are strongly 
, ficant in equations linking family characteristics and achievement that 
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can be easily included and that children probably could report with some accu- 
racy. The first variable is family size (number of siblings), which should present 
little problem for student reporting. The second is current age of mother. The 
age of mother combined with the child’s age would enable the variable of age 
of mother at birth to be computed. Some pretesting may be required to deter- 
mine the method of asking these questions, but even reporting mother’s age in 
gross categories — five-year groupings — would be an improvement. 

Recent research is finding that two-parent families with a stepparent do 
not have similar effects as do two biological parents (McLanahan and Sandefur 
1994). The effects on children from a family including a stepparent appear to 
be closer to single-parent effects than to living with two biological parents. So 
information that could distinguish two-parent biological families from those 
with a stepparent would be useful. Adding a question on whether the parents 
are divorced is one approach. Asking separate questions about living with each 
parent is another approach. 

One other variable that should be considered is locus of control. Locus of 
control is derived from a set of questions focusing on the perceived ability to 
affect life events. There are now more specific sets of questions that focus on 
specific events or conditions such as school performance. Locus of control has 
been collected in the NELS and NLSY data sets and is strongly statistically 
significant in equations relating achievement to family characteristics after all 
the common family characteristics are entered. 

Improving children's responses. It appears that students have the least knowledge 
about post-high school education levels of parents. One hypothesis is that chil- 
dren have simply never asked parents about education level. Another is that 
parents report inaccurate levels of education to children, somewhat inflating 
their own level of education. In the former case, it may be possible to have 
children formally or informally ask parents prior to the test. This could take 
the form of a simple request before the test or a more formal written form for 
the parents to fill out. Pretesting this approach could help determine which 
hypothesis is causing the inaccuracy in reporting. 

Collecting supplemental data from teachers. While individual level parental char- 
acteristics are desirable, teachers of NAEP students currently fill out an extensive 
survey that could be used to obtain family information. Teachers currently do 
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not provide information concerning the socioeconomic characteristics of their 
students. Teachers could be asked several questions concerning the character- 
istics of the groups of students in their classes that might improve the data on 
family characteristics. These questions would take the form of identifying per- 
centages of the students that fall into various categories. Income levels would 
probably be the most useful information. Giving teachers broad categories of 
income could prove better than the category of free and reduced price lunch as 
a control for family income. Items could include estimates for nearly all the 
important family variables. Such information could be first collected on a trial 
basis at low additional cost, perhaps for one year and utilized to see whether it 
improves the models. 

Using state census data to improve individual level variables. We have utilized Cen- 
sus data to improve NAEP family variables at the state level. If NAEP data 
were only to be analyzed at the state level, the Census data combined with 
NAEP data could probably provide good estimates of all family background 
variables. However, the real value of NAEP data lies in the individual level 
data, and direct Census data have not been available at that level. So similar 
techniques cannot be used to directly derive school or individual level Census 
estimates. 

If is possible to improve some of the reported NAEP variables at the 
school and individual levels by using the knowledge gained from state level 
comparisons. State level comparisons provide information about the accuracy 
of items such as parental education, and this information can in a limited way 
be used to impute better estimates to individual level variables. One simple 
application of this is to combine high school plus and college as a single cat- 
egory. 

Further regressions across states linking the NAEP and Census estimates 
can provide information about how differences are connected to other family 
characteristics. For instance, the errors in reporting family education may be 
greater in states with high minority populations and lower incomes. This kind 
of information may be useful to impute better values at the individual level 
data. Such work would seek to better identify the types of students who report 
accurate and inaccurate data. However, while this approach should be tried, it 
would probably not result in dramatic improvements in the quality of indi- 
vidual level data. 

1 0:3 • 
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Using supplementary data collection. The key information about family character- 
istics at the school level that would improve NAEP data might also be gathered 
directly from Census data. While privacy concerns limit the data available from 
Census at individual levels, the U.S. Department of Education would probably 
be able to obtain from Census data the school level population characteristics, 
if school boundaries were available. This would have to be done in conjunc- 
tion with the NAEP data collection by collecting school boundary data on maps. 
Many school districts may be sufficiently large to allow Census to provide 
school data aggregated from the block level. This option should certainly be 
explored with the Census Bureau, and its cost assessed. There are many com- 
mercial vendors who can provide such data if given maps for specific 
disaggregated areas. The relative cost of this option compared to the cost of 
NAEP would be low. 



The Census data could provide almost all the important background char- 
acteristics at the school level. But it would only be for all families in the 
area — not just the characteristics of families with fourth graders, for example. 
But the data would be highly correlated. Such data also could not track well 
the changes over time. Finally, the data would also be biased to the extent that 
the student population is not defined by specific geographical boundaries. But 
the advantages of this method would be the relatively low cost and the ability 
to provide a much richer set of characteristics at the school level. 



Limiting parental data collection. Parental data collection for NAEP has always 
been a politically controversial issue, so extensive data collection similar to 
the type of collection performed on other U.S. Department of Education sur- 
veys is probably not feasible. The NELS, for instance, collects data from parents 
in an extensive survey. We consider here the minimum level of information 
which parents could provide that would enhance the NAEP data. The primary 
reason for parental data collection is to strengthen the individual level data in 
NAEP. A simple one-page form with no more than five items could solve the 
major problem with NAEP family data. It would take no more than a minute or 
two to fill out. It would ask for the key family background variables necessary 
for achievement score equations that are not accurately provided by the stu- 
dent. They include education level of each parent, family income in categories, 
and age of each parent. While a more extensive survey could certainly provide 
useful information, this minimum level of information would allow consider- 
ably more confidence in the use of individual level NAEP data without placing 
O idue burden on parents or children. 
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Summary 

The interdisciplinary nature and the inherent complexity of educational 
research contribute their own set of challenges, but an additional reason for the 
lack of success in building consensus in educational research is the low invest- 
ment in educational R&D and more broadly on R&D on children. On average, 
the nation spends approximately 2-3 percent of its gross domestic product for 
R&D. However, this proportion is not uniform across sectors of the economy, 
but can vary from less than 1 percent to approximately 20 percent (pharmaceu- 
ticals and integrated circuits) (Grissmer 1996). Currently, we spend less than 
0.3 percent of educational expenditures for R&D, and less than 0.3 percent of 
expenditures for children are directed toward R&D on children (Consortium 
on Productivity in the Schools 1995, Office of Science and Technology Policy 
1997). Compared to other sectors, this is a very low investment in R&D. Per- 
haps the reported problematical quality of educational R&D is partly due to 
the insufficiency of funding, when compared to its inherent complexity 
(Grissmer 1996; Wilson and Davis 1994; Atkinson and Jackson 1992; Saranson 
1990). Alternately, the low funding level might reflect the poor quality of R&D. 

Successful R&D is the engine that drives productivity improvement in 
every sector of our economy. Thus, strong R&D in education is a prerequisite 
to continual improvement in our education system and in our children’s well- 
being. Without solid R&D, we will continue to go through wave after wave of 
reform without clearly separating the successful from the unsuccessful. 20 It is 
difficult to see how American K-12 education can become world class unless 
our educational R&D begins to build a more solid foundation of knowledge 
concerning education. If R&D can begin to play the role that it does in virtu- 
ally every other sector of our economy, then continual educational improvement 
can be taken for granted, just as continual improvement in automobiles, com- 
puters, and life expectancy is now taken for granted. 



20 



It is not that some reforms may not have been effective or had an impact on educational 
outcomes. The history of student achievement and educational outcomes suggests that 
scores have risen over long periods of time — and that students of a given era always seem 
to outscore their peers of earlier eras (Neisser 1998; Rothstein 1998). Rather, R&D could 
considerably improve the efficiency of the process of sorting the various reform initia- 
tives and ensuring that the best are saved and the worst discarded. 
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Appendix 

Simple Process Models of Class Size Effects 

We start here by developing some simple models of the mechanism within 
classrooms that might cause class size effects and follow the implications of 
these assumptions on how we should specify models and why class size effects 
might be expected to have fairly wide variance. We do this simply to show that 
an important link is missing, a link that can guide us in specifying models and 
interpreting results of previous studies. If class size effects are produced by the 
kind of mechanisms assumed here, it implies that actual class size effects should 
have a wide variance and that some of the model specifications that were thought 
to be best actually can provide highly biased results. 

Reductions in class size must change processes that occur in the class- 
room in order to have impacts on achievement. These differences in process 
that occur within smaller classes appear to determine whether class size affects 
achievement at all, whether effects are large or small, and whether effects widen 
or stay constant over several grades (Mumane and Levy 1996). In addition, the 
design of assessment instruments can determine whether class size effects are 
present in measurements. 

Unless we know what processes change and how achievement is assessed, 
we cannot determine what model specifications and estimation techniques are 
appropriate. Since the data to determine what processes change in smaller class 
sizes are generally not collected, it will be difficult to sort out the reasons for 
the wide variance in the previous literature. We will discuss some simple, but 
extreme models to illustrate the point. 

Demand for Teacher Individual Time 

If we assume a “college professor” lecture model of classroom proce- 
dure, where there is essentially little or no interaction between teacher and 
student either during or after class and administrative time is borne by teach- 
ing assistants, then class size makes no difference. In this case, there is no cost 
to the teacher in having more students in the class. Class size makes a differ- 
ence only when we assume that some teacher time is taken up by individual 
students — either through questions, special academic assistance, disciplinary 
actions, or administrative time (grading homework). In this case, additional 
O 
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students add to the teaching workload. If teachers have a fixed amount of time, 
then adding students can result in less time for presenting material or less time 
for student assistance. Thus, the size of the class size effect should depend on 
the portion of time teachers spend dealing with individual students (in one way 
or another) vs. time spent in general, lecture style instruction. In general, the 
more students need individual time, the larger will be the class size effect — 
other things being equal. 

A second consideration is the variance among different types of students 
in requiring individual attention. A reasonable assumption here is that higher 
ability students or those with higher levels of family resources (broadly de- 
fined) — on average — will require less individualized attention. Essentially, 
substitution is occurring between family resources and school resources. In 
families with more resources, more of the students’ academic and psychologi- 
cal needs are addressed at home, requiring less attention at school. This can 
include simple things such as helping with homework, enhancing learning op- 
portunities, tutoring, and addressing the child’s behavioral problems. For lower 
ability children or those with fewer family resources, more individualized at- 
tention will probably need to occur at school in order for them to achieve learning. 

Thus, one would expect that class size effects would be larger for classes 
with lower ability students or students with fewer family resources. This also 
implies that there will be maximum class size levels (thresholds) that allow all 
the productive individual attention required, and above which no further class 
size effects will occur. But this threshold will vary by level of family resources. 

Teacher and Curriculum Decisions 

A third consideration is the teacher’s reaction to scarcity of time. Teach- 
ers continually make choices about how fast to proceed with the scheduled 
curriculum, how much time to allow for slower students vs. faster students, 
and how much time to put into individual instruction vs. lecturing. With more 
students per class, these decisions become critical in determining whether class 
size effects occur. One scenario is that teachers slow down the pace of instruc- 
tion in response to time scarcity. Individualized instruction is maintained for 
slower students, but less material is covered for all students. So the net effect is 
to cover less material for the school year for the whole class. Here, one might 
expect to see class size effects for the higher ability students (less material 
, but less so for those of lower ability. 
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Another teacher strategy is to cover all the material throughout the year 
(more time lecturing) and spend less time in individual instruction with slower 
children. Here, average class scores should shift downward in larger class size, 
but in a different pattern. Scores of higher scoring students would not be af- 
fected, but lower scoring students would have lower scores. 

A crucial consideration in measurement is how the curriculum is adjusted 
the next year in response to these teaching strategies. If the effect of larger 
classes is failure to cover all the material for all students in the class, the next 
year’s curriculum may or may not include all the material. If the curriculum 
accommodates this and starts where the previous year left off for each student , 
then over many years there will be an increasing gap between children in larger 
and smaller classes, i.e., the size of the effect will depend on the cumulative 
years in smaller class size. Thus, if smaller classes were instituted in grades 
K-8, one would expect to see a widening gap with each grade. 

On the other hand, the start of next year’s curriculum could begin uni- 
formly for all students regardless of the amount of material covered last year. It 
could be started at the point where the larger or smaller class sizes left off. If it 
starts where the larger class sizes left off, then the gain from extra material 
covered in the smaller class in the previous year is lost. Thus, no cumulative 
effect is present, but a uniform score difference will be present each year. Es- 
sentially the smaller classes will cover additional material each year, but the 
gain from the previous year will be lost. 

If the curriculum for all students is set where the smaller class sizes left 
off — leaving a permanent gap in coverage for those in larger class sizes — then 
whether the effect is cumulative depends on the extent to which mastery of the 
previous grade’s material is required to perform well in the current grade. For 
subjects like math and reading, earlier mastery is probably more essential, and 
a widening gap would occur over several grades, i.e., the annual gap in mate- 
rial coverage would cumulate, causing further deterioration in later scores. On 
the other hand, in subjects like history or geography, where earlier mastery 
may be less important, the previous year’s gap plays no role in next year’s 
score, and a constant class size effect would be expected by grade. 

Design of Assessment Instruments 

Another consideration affecting the size of the measured class size is 

assessment tests are designed. Designers of norm- referenced assessment 
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tests first sample students’ current knowledge at a given grade and develop a 
battery of questions that attempt to span the entire domain. A set of questions 
is chosen that provides a continuous range of question difficulty such that a 
different percentage of children answers each question correctly. Some 
questions nearly all students answer, while some are included that only a 
small percentage answer. 

However, the domain of knowledge can depend on the size of classes 
attended by students. It is possible in some circumstances to have extra mate- 
rial covered by smaller class sizes included in assessments, while in other 
circumstances the extra material will not be part of the test. For instance, if 
tests were developed five years ago based on the then-existent domain of knowl- 
edge, and class sizes have declined since that time resulting in more material 
covered, the assessment instrument may not pick up the class size effect. Simi- 
larly, if assessment instruments are designed with students in larger classes 
prior to experimentation with smaller class sizes, then it is possible for the 
effects of class size to be attenuated if the instruments do not reflect possible 
additional material covered by smaller classes. In general, instruments designed 
to measure students’ knowledge across several grades, rather than within each 
grade, and “re-normed” more frequently will be less vulnerable to these kinds 
of design effects. 



Some Implications for Measurement and Specification 



The above discussion illustrates that the size of the effect, its measure- 
ment, and its interpretation can depend on what occurs differently within the 
classroom when larger and smaller class sizes occur and on how assessments 
are designed. It implies that actual effects could vary considerably depending 
on different levels of student demand for individual time, teacher strategy, the 
coordination of the curriculum (e.g., year to year by class size), the different 
dependence by subject on previous knowledge, and assessment design. It would 
not be surprising in our decentralized educational system that smaller class 
sizes generate a wide variety of teacher and curriculum responses. Thus, ambi- 
guity of results may not be surprising. Moreover, we may never be able to sort 
previous studies into groups with similar classroom process controls because 
the data along these kinds of dimensions were never collected for previous 
studies. So much of the work with previous data collections lacking these vari- 
ables may have to be discounted. 
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The specifications of previous models have rarely taken explicit account 
of expected effect differences by family resource levels or tested for whether 
effects were constant or widened by grade. The latter consideration is critical 
to determining how models should be specified. For instance, if the conditions 
are present for a constant rather than an accelerating gap by grade, value-added 
models that control for previous years’ test scores can show null effects of 
class size even though effects are present each year (Krueger 1999b). Effects 
would show up only in the first year in which class sizes were changed, but not 
in subsequent years. Such models would pick up only grade-by-grade accel- 
eration in score changes. Here, simple cross-sectional models by grade without 
control for previous scores would show the total constant and cumulative ef- 
fect to each grade. 

The processes discussed above may or may not be the actual ones that 
exist in classes to produce class size effects. They simply point to the need to 
develop theories of the mechanisms underlying class size effects and to collect 
the data to test different theories. While a limited number of existing data sets 
might be able to start this process, it is difficult to see how definitive results are 
possible without more experimentation with more robust data collection on a 
much wider set of variables. Only by sorting this out can we be confident that 
models are specified correctly, estimation techniques are appropriate, and in- 
terpretations are accurate. 
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Response ; GuaMaumce for Future Directions 
in Improving the Use off NAEP Data 



Sylvia T. JFolmsomi 1 
Howard University 



The issues of how to meaningfully use state National Assessment of Edu- 
cational Progress (NAEP) data to assess and improve student achievement are 
the foci of the Grissmer and Flanagan and Raudenbush papers. They are excit- 
ing in their ramifications for future research and policy directions. The following 
discussion briefly describes NAEP and the whole idea of state-level data, then 
proceeds to review these papers in the context of their value in providing strat- 
egies for making these data more useful and informative assessments of national 
educational progress. 

The assessment — as well as the improvement — of student achievement 
has long been a focus of educational policy at the state and the national levels 
and in the front lines of local school districts. With different emphases at dif- 
ferent points in time, NAEP was originally designed as “the nation’s report 
card” to provide information on student achievement in subjects widely taught 
in public schools, for the nation as a whole and for specific demographic and 
geographic subgroups. However, in the first iterations of NAEP back in the 
1970s, the regional subgroups were large, each including several states. It was 
not until the introduction of the Trial State Assessment (TSA) in 1990 that a 
sampling and administration structure was developed which allowed for the 
direct comparison of states with one another. Such between- states compari- 
sons were not the explicit intent of the program. Rather, the TSA was intended 
to allow each state to compare its performance with that of the nation as a 
whole or perhaps with similar states in its own geographic region. 



1 The author is Professor, Research Methodology and Statistics at the School of Education, 
Howard University, and a principal investigator for the Center for Research on the 
Education of Children Placed at Risk (CRESPAR), an OERI-funded research center. She 
y ~iay be reached at Howard University, 2900 Van Ness Street NW, 116 Holy Cross Hall, 

FRir /ashington> DC 20008 - 
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Actually, the involvement of states in NAEP is not new. An association of 
state educators, along with a core working group of the nation’s top psycho- 
metrics scholars, were involved in the original conception and planning of 
NAEP. 2 They wanted to implement a national assessment program to docu- 
ment student progress in a manner which would not pose a threat to 
lower-performing district participation. To help ensure a low-key, relatively 
nonintrusive assessment program, NAEP results reported the percent of stu- 
dents who correctly answered each “exercise,” as the assessment items were 
termed. Results were reported for geographic areas, and national samples of 
students were identified at ages 9, 1 3, and 17. NAEP currently assesses samples 
of students in grades 4, 8, and 1 1 , but also reports trends for both age and grade 
for cross-sectional samples from about 1970 to the present. The initial trend 
year varies according to the time at which trend samples were introduced in 
each subject matter area (Campbell, Voelkl, and Donahue 1997). 

In fact, the actual implementation of the Trial State Assessment has had a 
marked effect on how we measure student achievement. First, a motivational 
effect seems apparent in the TSA scores: they are a bit higher than regular 
NAEP scores. Second, certain states were anxious about their comparative stand- 
ings; therefore, the “multiple comparison charts” show which unadjusted state 
mean differences were statistically significant from one another, as well as 
which differences were in the range of what would be expected simply due to 
chance. In these tables, no adjustments were made for student and family char- 
acteristics or for school resources and teacher background differences, although 
the importance of these factors certainly had been demonstrated in research 
studies carried out by NAEP, as well as in analyses of NAEP data in the litera- 
ture. The Raudenbush and the Grissmer and Flanagan papers both addressed 
this problem of more meaningful use of state TSA data to assess student achieve- 
ment, and both papers focus on the Trial State Assessment of NAEP. The data 
are based on eighth grade mathematics proficiency estimates from the Trial 
State Assessment. 

Response to the Raudenbush Paper 

In his paper, Raudenbush proposed a synthesis of state results by devel- 
oping models that include correlates of student proficiency within and between 
states. He began with a “Conceptual Model for State-level Policy Effects on 
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Student Achievement” (figure 1), which shows state government acting on stu- 
dent achievement primarily through its effects on school resources and home 
backgrounds. His analysis thus began by using NAEP data to assess the contri- 
bution of school resources; e.g., school and teacher quality, and students’ home 
background to student achievement. The analysis was done for each of the 40 
participating states, thus providing within-state home and school correlates of 
mathematics proficiency for eighth graders. He found a substantial correlation 
between socially disadvantaged or ethnic minority status, parental education, 
and access to the key resources available for learning, specifically course-tak- 
ing opportunities, positive school climate, qualified teachers, and cognitively 
stimulating classrooms. He noted that these findings are similar to other find- 
ings in the literature. Carrying this analysis to another level, Raudenbush found 
considerable variation in the patterns of these correlations across the 41 states 
participating in TSA. These findings provided estimates for the direct effects 
of schools and teachers on student achievement while controlling for the cor- 
relation between self-reported student background, school factors, and teacher 
practices. 



Raudenbush’s analytic approach offers far more useful information to 
states than the conventional means from the NAEP TSA. By comparing states 
on resources and educational opportunities, this work enables the examination 
of possible changes in policy that are likely to positively influence student 
achievement. This analysis utilizing hierarchical linear models demonstrates 
that only a small amount of residual variance exists between states that is not 
related to school resources and family background. It should be noted here that 
there is a wide range in the proportion of within-school variability within states, 
which Raudenbush points out is also apparent across states. But the within- 
state variation is worth the attention of individual states; and there is some 
work in this area, for example, William Cooley’s paper on Pennsylvania 
(Beckford and Cooley 1993), which examines schools with sizable numbers of 
African American students in which these students score at or above the state 
mean on achievement measures, and which also cites other relevant investiga- 
tions into these questions. 



Given the demonstrated importance of school resources to achievement, 
the second part of the Raudenbush paper presented an examination of state 
differences in school resources. This work explores two questions: “Does the 
distribution of school resources likely reinforce or counteract inequalities aris- 
y from home environment? Do states differ, not cmiIv the provision of 
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resources, but also in the equity with which they are distributed?” This exami- 
nation thus focused not just on resource differences between states, but also on 
how equitably resources are distributed within states. These within-state re- 
source differences were then examined, not only in terms of student 
demographic background, but also in the interaction between resource differ- 
ences, race-ethnicity, and parental education. Raudenbush’s analysis and his 
illustrative plots (figures 4, 5, 6, and 7) show that the probability of access to 
key resources is a function of these background factors. Further, he found siz- 
able differences attributable to student background, education, and ethnicity; 
and these differences were associated with access to resources related to school 
success. For example, factors such as teacher quality and experience, school 
climate, whether or not a school offered algebra, and whether students were 
assigned to math teachers who majored in math and who emphasized reason- 
ing in their classroom instruction — these were related to success, as well as to 
the variables used as predictors in Phase I of the Raudenbush work. These 
same variables, when examined across states, result in the ellipses (figures 8 
and 9) that visually show the relative access to resources provided to African 
American students. For example, figure 8 shows that in South Carolina and 
Mississippi, having parents who are college educated offers only modest ad- 
vantage to students, and the advantage is about the same for African Americans 
and youth of other backgrounds. 

The meaning and the utility of Raudenbush’s findings and this methodol- 
ogy for states and districts, and perhaps also for schools, are substantial. First, 
the absence of large statistical differences between adjusted means should in 
no way encourage states that they are to do little. The kind of action that would 
be prompted from the states was a concern of the National Assessment Gov- 
erning Board when it decided, when plans were made for reporting the 1990 
results, to report unadjusted means only. Second, the Raudenbush analyses 
clearly show the importance of access to school resources for student achieve- 
ment. In order to move toward equity for all students, these findings demonstrate 
unequivocally that resource accessibility is a key factor. 

In terms of the current “affirmative action” debate, these findings also 
have important implications for many states, such as Florida and California. 
First, can a state logically expect proportional representation by ethnicity on 
college entry characteristics such as test scores and course-taking patterns when 
it has systematically limited access to resources available to students? Given 
O lemonstrated relation of resources to measured achievement reported in 
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the Raudenbush paper, it would seem only logical that a state, in making col- 
lege entrance decisions, should take into consideration the relative access it 
provides to certain resources in elementary and secondary school. Such a pro- 
cedure would need, certainly, a well-specified model and additional research. 

Though the many issues are still in flux, reactions to affirmative action 
are often far too simplistic in their conceptions of how the specific mecha- 
nisms work in practice. Where race has served as a factor in the allocation of 
skilled teachers and other education resources, whether by design or not, this 
condition raises the question of how this method of allocation has to be consid- 
ered when the measurable results of such allocations are evaluated. 

The author’s broad recommendations should be noted here. If student 
progress and subject matter proficiency are examined on a year-to-year basis, 
the extent to which the educational system (or even the school) differentially 
provides resources that support student progress should also be examined. This 
is the opportunity-to-learn concept. The soundly based but creative methodol- 
ogy that is employed in the Raudenbush work offers strong promise for helping 
us better understand how student background and resource access are interre- 
lated, as well as when such access factors have been modified. The latter could 
be examined by extending the analysis over time so that the progress of states 
in modifying resource access could be followed and appraised. The author 
suggested extending the collection of data by NAEP to measures of resources 
and development indicators from these data. Such a direction seems logical 
and important, but it stands in opposition to current plans to release results 
more quickly and to collect fewer data from students, teachers, and schools 
than has been done in the past. 

How can states and schools use these findings? To begin with, they can 
collect comparable data at the district and school levels so that the internal 
allocation of resources is more completely documented. They can also modify 
teacher assignments to provide more equitable distribution of highly skilled 
teachers, although such a goal may entail the need for financial incentives, 
along with improving the facilities and working conditions for teachers in some 
schools, and other strategies. For teachers, the range in access to everything — 
from professional development to clean bathrooms — is very great across 
schools, even in the same or nearby districts. State officials may lobby for 
increased state support to remedy access problems: they could target selected 
O ' ools and districts, using these findings as a basis for the request. They may 
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investigate how parental education operates to increase access to resources 
and provide help to parent groups in lower-access schools to develop strategies 
to bring about change for the better in their schools. They can recognize the 
broad scope of the documented inequities and develop a broad-based strategy, 
sustained over the long term, to simultaneously and progressively change the 
inequities in resource allocation that are so widely spread among states, sys- 
tems, and schools. 

Response to the Grissmer and Flanagan Paper 

In their paper, “Moving Educational Research toward Scientific Consen- 
sus,” Grissmer and Flanagan assert the need to improve the consistency and 
accuracy of results in educational research so that a basic knowledge base in 
education can be built, one that can be accepted by a diverse research commu- 
nity as well as by educational practitioners and policymakers. The authors assert 
that improving nonexperimental data, along with the associated methodology, 
may not be enough to achieve consensus. Rather, they contend that experi- 
ments are needed and, further, that experiments should often employ models 
such that the size of a given year’s effect can be viewed as dependent on the 
current year’s and previous years’ effects. These findings should then be used 
to build micro-theories of educational process. 

Grissmer and Flanagan dealt with the issue of the relation between school 
resources and school achievement in two major examples. The first example 
which they cite is the rapid change in NAEP scores of black students, espe- 
cially from 1970 to 1988 or 1990. In the case of black student progress in 
NAEP scores, Grissmer and Flanagan gave credit to compensatory and devel- 
opmental programs and school desegregation activities, but they did not offer 
conjecture regarding the score declines that occurred starting from 1988 orl990. 

The Tennessee class size study is well presented by Grissmer and 
Flanagan; and the posing of a simple process model for these effects is a useful 
and important addition to the literature. The detail presented to amplify and 
explain how teacher reactions to the scarcity or abundance of class time may 
interact with student characteristics is thoughtful. Readers are encouraged to 
take the time to examine that part of the paper carefully, as it provides further 
support for the need of extensive data collection, either from teacher question- 
naires, interviews, or classroom observations. 

ERfC 
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It is good to see the Tennessee class size experiment get the kind of atten- 
tion toward implementation that it has deserved for some years. This long-term 
experiment received relatively little attention in the policy arena until recently, 
but now seems to be getting wide notice. Certainly, the Tennessee class size 
study has reached the ear of the President — it is, indeed, a factor in the call for 
many new teachers across the nation. Interestingly, the Tennessee study results 
from collaboration between the historically black university, Tennessee State 
University, and the State Department of Education, with important guidance 
from a participant in this conference, Jeremy Finn. This well-designed study 
made it possible to study the effects of smaller classes over time. Thus, it is an 
excellent model for the kind of work that needs to be done to develop and test 
theories in education. 

In addition to their suggestions for extending and improving NAEP data 
collections, Grissmer and Flanagan suggest improving NAEP by collecting 
family characteristics at the school level, possibly using Census Bureau data to 
augment NAEP. There are some states and other jurisdictions which have used 
Census data and other federal reporting information to improve their estimates 
for allocation of social and economic services. A major problem in this work 
has been the adequacy of geo-coding (the coding of addresses and other loca- 
tion information) in the files proposed for use to improve estimation. The 
problem is especially severe in sparsely populated areas; namely, rural com- 
munities, older urban industrial zones that have lost population with the closing 
of plants, and small towns. A National Research Council panel has been exam- 
ining problems of estimation of poverty in small geographic areas, and the 
U.S. Department of Education, through the National Center for Education Sta- 
tistics, is a sponsor of this work. Working in close collaboration with the Census 
Bureau, the panel has been operating for about 3 years, has published three 
interim reports, and is working to complete a final report (National Research 
Council 1999). Their findings should be useful in the improvement of param- 
eter estimation for many forms of resource allocation. 

Grissmer and Flanagan also suggested a school district sample rather than 
a school sample for NAEP and the use of a longitudinal cohort. A district sample, 
though more expensive to collect, might enable comparisons of scores for ur- 
ban and suburban districts within metropolitan areas of similar size, though 
this level of reporting is disallowed under current NAEP authorization. Now, 
of course, prior to TSA, NAEP was a low-stakes testing program. Since com- 
O 
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parisons are now easily made between states, it is worth considering whether 
we might facilitate between-district comparisons, if solid methodology that 
gives consideration to resource allocation is used to interpret those differences. 
Such a change would raise the stakes for State NAEP as well as for national 
NAEP. Given the effect of other high-stakes testing programs and the funda- 
mental role of NAEP as the nation’s report card, such changes should be 
carefully reviewed. 

The authors support the use of longitudinal cohorts. A longitudinal study 
would involve identifying students or at least forming blocks at the school or 
district level, but the advantages would need to be weighed along with costs. 
NAEP currently examines trends by retaining a common core of test items 
which are administered to cross-sectional grade level groups. 

Conclusion 

Now let us consider what these papers, taken together, tell us. Both stud- 
ies point out the importance of a school’s climate and culture to discipline. In 
our work at CRESPAR, the Center for Research on the Education of Students 
Placed at Risk, an OERI-funded research center located at Howard University 
and Johns Hopkins University, we are guided by a talent development model 
recently articulated in an article by my colleague, Serge Madhere (1998; see 
also Boykin 1996). This model has a number of points, the most important one 
of which is that all children can learn, given adequate opportunity and that 
their backgrounds and culture have strengths that can be built on to motivate 
and encourage student learning. Learning is more a function of coherent in- 
struction than of a child’s social origin. Motivation begets greater learning, 
which begets greater motivation. Nurturing is the key to motivation, especially 
at difficult transition points. 

Both of these papers offer creative methodological approaches to the use 
of state-level data for school improvement and for theory building. Both dem- 
onstrate the importance of resource allocation for student achievement and the 
interaction between race and resources, and both imply procedures for increasing 
proficiency among African American students. Both imply the need for more 
complex NAEP data at the level of the student, the family, the teacher, the 
school, and the state. This emphasis, however, runs counter to the current push 
to simplify and speed up the data collection and reporting process. More com- 
- q ” data require more complex consideration before developing conclusions. 
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Both studies show the need to better understand the why’s and how’s of im- 
proving student achievement. For example, how do teachers use time? When 
they have more time per student, what are the features of resources that make 
them effective in influencing achievement? 

These are important directions for researchers and policymakers to con- 
sider. Fortunately, there is much in these papers to help guide that progress. 
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Umdeiretamdirng Ettmk Blffffereimces m 
Academic Achievements Empirical Lessons 
from National Bata 

Meredith Phillips 

University of California, Los Angeles 

In 1966, James Coleman published results from the first national study to 
describe ethnic differences in academic achievement among children of vari- 
ous ages. Since that time, we have made considerable progress in survey design, 
cognitive assessment, and data analysis. Yet we have not made much progress 
in understanding when ethnic differences in academic achievement arise, how 
these differences change with age, or why such changes occur. 1 The purpose 
of this paper is to highlight several reasons why we have learned so little about 
these important issues over the past few decades. I begin by reviewing recent 
research on how the test score gap between African Americans and European 
Americans changes as children age. I then discuss several conceptual and meth- 
odological issues that have hindered our understanding of ethnic differences in 
academic achievement. I raise these issues in the hope that we will make more 
progress toward eliminating the test score gap during the next decade than we 
have during the last. 2 



1 I use the term “ethnic” to refer to the major ethnic and racial groups in the United States 
(namely, African Americans, European Americans, Latinos, Asian Americans, and Native 
Americans). Whenever the samples are large enough, I also consider variation within 
these socially constructed categories (for example, differences between Mexican 
Americans and Puerto Rican Americans). 

2 I thank Robert Hauser, Larry Hedges, Christopher Jencks, Jeff Owings, and Michael Ross 
for their comments on an earlier draft. I did not make all the changes they suggested, 
however; and they are in no way responsible for my conclusions. Please direct all 
correspondence to Meredith Phillips, School of Public Policy and Social Research, 

UCLA, 3250 Public Policy Building, Los Angeles, CA 90095-1656 or 

hillips @ sppsr.ucla.edu . 
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Does the Achievement Gap Change as Children Age? 

My colleagues and I recently analyzed data from a number of national 
surveys in order to estimate how the achievement gap changes as children age 
(see Phillips, Crouse, and Ralph 1998). Answering this question can help us 
understand the potential causes of the gap. Suppose, for example, that the black- 
white gap did not widen at all after first grade, even among black and white 
children who began school with similar skills. If that were the case, we might 
conclude that families, communities, preschools, or kindergartens were mainly 
responsible for the gap. On the other hand, suppose that the black-white gap 
did widen between the first and the twelfth grades, even among children who 
started school with similar scores. If that were the case, we might conclude 
that schools were mainly responsible for the gap. As it turns out, the “truth” 
seems to fall somewhere between these extremes. 

Cross-sectional Results 

One way to describe age-related changes in the black-white gap is to 
estimate the size of the gap in as many surveys as possible and then combine 
these estimates. We have done this with the national surveys listed in table 1. 
Figure la arrays the black-white math gaps from these surveys by age. The 
lines around the estimates show their precision. We can also array these gaps 
by year of birth, which shows the historical trend in the black-white math gap 
(see figure lb). Because the black-white gap narrowed during the 1970s and 
1980s, however, we need to make sure that age-related changes in the gap are 
not confounded with historical changes. In order to disentangle the effects of 
age from the effects of history, we estimated a multivariate model that con- 
trolled for the historical trend while estimating the age-related trend. 3 Table 2 
presents these results. It shows the following: the black-white math gap wid- 
ens by about 0. 1 8 standard deviations between the first and the twelfth grades; 
the reading gap stays relatively constant; the vocabulary gap widens by about 
0.23 standard deviations. 4 A gap of one standard deviation on the math or 
verbal SAT is 100 points. Therefore, our cross-sectional results imply that the 
black- white math and vocabulary gaps widen by the equivalent of just under 2 



3 For details on the sample and analysis, see Phillips, Crouse, and Ralph (1998). 

4 To obtain these estimates, multiply the coefficients in the first row of table 2 by 12 years 
£ 'school. 
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Table 1. Data Sets Used in Meta-analysis 


Acronym 


Name 


Test Yea r(s) 


Grades Tested 


EEO 


Equality of Educational Opportunity Study 


1965 


1,3,6,9,12 


NLSY 


National Longitudinal Survey ofYouth 


1980 


10,11,12 


HS&B 


High School & Beyond 


1980 


10,12 


LSAY 


Longitudinal Study of American Youth 


1987 


7,10 


CNLSY 


Children of the National 
Longitudinal Survey ofYouth 


1992 


Preschool, K, 
1,2, 3, 4, 5 


NELS 


National Education Longitudinal Study 


1988, 

1990, 1992 


8,10,12 


PROSPECTS 


Prospects: The Congressionally- 
Mandated Study of Educational 
Growth and Opportunity 


1991 


1,3,7 


NAEP 


National Assessment of Educational 
Progress 


1971-1996 


4,8,11 



Figure la. 

Standardized Black-White Math Gaps, by Grade Level 
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Figure 1b. 

Standardized Black-White Math Gaps, by Year of Birth 




Table 2. Effects of Grade at Testing and Year of Birth on Black-White 
Test Score Gaps 



Dependent Variables 






Mathematics 
( N=45) 


Reading 
(N= 45) 


Vocabulary 
(N= 20) 


Independent Variables 




1 


2 


1 


2 


1 


2 


Grade level 


B 


.015 




.002 




.019 






SE 


(.004) 




(.006) 




(.006) 




Grades 1-6 


B 




.051 




-.011 




.034 




SE 




(.014) 




(.023) 




(.012) 


Grades 7-8 


B 




-.054* 




.016 




.025 




SE 




(.028) 




(.051) 




(.032) 


Grades 9-12 


B 




.021* 




.010 




-.018 




SE 




(.013) 




(.024) 




(.017) 


Month of testing 


B 


-.011 


-.007 


.003 


.000 


.015 


.011 




SE 


(.004) 


(.004) 


(.005) 


(.007) 


(.018) 


(.018) 


Year of birth before 1978 


B 


-.014 


-.014 


-.020 


-.020 


-.010 


-.011 




SE 


(.002) 


(.002) 


(.002) 


(.002) 


(.003) 


(.003) 


Year of birth after 1978 


B 


.002* 


.004* 


.020* 


.018* 


.031* 


.039* 




SE 


(.006) 


(.005) 


(.009) 


(.010) 


(.011) 


(.012) 



O 




Understanding Ethnic Differences in Academic Achievement: Empirical Lessons 107 



Table 2. Effects of Grade at Testing and Year of Birth on Black-White Test Score Gaps 
(continued) 



Dependent Variables 







Mathematics 


Reading 


Vocabulary 






(A/=45) 


(N= 45) 


(A/=20) 


Independent Variables 




1 


2 


1 


2 


1 


2 


Longitudinal survey 


B 


-.039 


-.043 


-.069 


-.063 


-.346 


-.273 




SE 


(.033) 


(.033) 


(.047) 


(.051) 


(.157) 


(•161) 


IRT metric 


B 


.175 


.149 


.159 


.174 


.068 


.000 




SE 


(.033) 


(.035) 


(.046) 


(.051) 


(.082) 


(.088) 


Intercept 


B 


.765 


.653 


.746 


.792 


.889 


.833 




SE 


(.034) 


(.054) 


(.056) 


(.092) 


(.049) 


(.057) 


Adjusted R 2 




.790 


.815 


.693 


.680 


.745 


.806 



NOTE: The dependent variables are standardized black-white gaps (i.e., (W-B)/SD T ) computed 
from the surveys listed in table 1 . The actual data appear in table 7A-1 in Phillips, Crouse, and 
Ralph (1998). Standard errors are in parentheses. The spline coefficients for grade level and 
year of birth show the actual slope for that spline. The spline standard error indicates whether 
the slope differs from zero. * indicates that the spline’s slope differs significantly from a linear 
slope at the .05 level. Each gap is weighted by the inverse of its estimated sampling variance. 
See Phillips, Crouse, and Ralph (1998) for details on the other variables in this analysis. See 
pp. 118-19 of Pindyck and Rubinfeld (1991) for an introduction to spline (piecewise linear) 
models. See Cooper and Hedges (1994) for details on the meta-analytic methods used in this 
analysis. 



SAT points a year, or by 18 to 23 SAT points over the course of elementary, 
middle, and high school. 

These cross-sectional estimates have two advantages over longitudinal 
estimates. First, the data span nearly all grade levels, from early elementary 
school through late high school. No national longitudinal survey has ever tested 
children over an interval spanning both elementary school and high school. 
Second, because cross-sectional surveys do not follow students over time, they 
are less subject to attrition and thus tend to be more nationally representative 
than longitudinal surveys. A problem with our cross-sectional results, how- 
ever, is that they combine data on children from different samples, who were 
assessed on different, possibly incomparable, tests. Another problem is that 
cross-sectional data cannot tell us whether the black-white gap widens among 
children who start school with the same skills. That question, which is central 
to the concern that schools may not be offering black and white students equal 
educational opportunities, can be answered only with longitudinal data. 
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Longitudinal Results 

During the late 1980s and early 1990s, two national longitudinal surveys 
assessed students multiple times as they moved through school. The National 
Education Longitudinal Survey (NELS) is the more familiar of these studies. 
NELS is a large national survey that first tested eighth graders in 1988 and 
then retested them in 1990 and 1992. Prospects, a survey of two cohorts of 
elementary school students and one cohort of middle school students that be- 
gan in 1991, is less familiar than NELS because it is not yet readily available to 
researchers. The Prospects data were collected mainly to evaluate the effec- 
tiveness of Chapter 1 (now Title I), but their secondary purpose was to describe 
yearly achievement growth during elementary and middle school. The young- 
est of the Prospects cohorts was first tested at the beginning of first grade and 
followed through the end of third grade. The middle Prospects cohort was first 
tested at the end of the third grade and followed through the end of sixth grade. 
The oldest cohort was tested at the end of seventh grade and followed through 
the end of ninth grade. 

In order to understand achievement growth over an interval longer than 
four years, we have to piece together data from these different cohorts. My col- 
leagues and I have used these data to estimate whether black children who start 
out with the same skills as whites learn less over the school years. 5 Our estimates 
are very imprecise because the Prospects sampling design was relatively ineffi- 
cient and because we do not have data for every school year. 6 Nonetheless, our 
results suggest that African American children fall somewhat behind equally 
skilled white children, particularly in reading comprehension, and particularly 
during the elementary school years (see figure 2). 7 Taken together, we estimate 
that at least half of the black-white gap that exists at the end of twelfth grade can 



5 See Phillips, Crouse, and Ralph (1998) for details. 

6 Also, a very large percentage of the Prospects students left the study before the second 
and third waves. When Phillips, Crouse, and Ralph (1998) compared cross-sectional and 
longitudinal samples drawn from Prospects, however, they found that the mean black- 
white gap differed by less than 0.05 standard deviations across all tests. And although the 
longitudinal samples were more advantaged than the cross-sectional samples, racial 
differences in attrition were small and mostly involved regions of residence and urban- 
ism. See chapter 3 of Phillips (1998) for more on nonrandom attrition in both Prospects 
and NELS. 






See Phillips, Crouse, and Ralph (1998) for a comparable figure using cross-sectional 
ta, as well as for a figure that shows the imprecision of these predictions. 
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Figure 2. Predicted Test Scores for Two Students, One Black, 
One White, Who Both Started First Grade with True Math, Reading, 
and Vocabulary Scores at the Mean of the Population Distribution 




White Student's Math, Reading, and Vocabulary Scores 
Black Student's Math Score 
Black Student's Reading Score 
Black Student's Vocabulary Score 



be attributed to the gap that already existed at the beginning of first grade. The 
remainder of the gap seems to emerge during the school years. 

This widening of the gap may not be attributable to schooling per se , 
however. Because of summer vacation, students spend only 180 days a year in 
school. Because neither Prospects nor NELS tested children in the fall and the 
spring of each school year, it is impossible to know how much of the gap that 
emerges over the course of schooling should be attributed to schools and how 
much should be attributed to summer vacations. 8 

In an ideal world, we would know precisely when ethnic differences 
in test scores first emerge and how they develop during the preschool years. 



8 Several other studies have examined summer learning patterns (e.g., Cooper et al. 1996; 
Entwisle and Alexander 1992, 1994; and Heyns 1978, 1987). Further, Prospects tested 
an unrepresentative subsample of students in the fall and spring of first and second grade. 
Q ~ review these results later in the paper. 
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We would also know how ethnic differences change both every school year 
and every summer. This information would help us identify the most important 
reasons why African American and Latino children score lower than whites 
and Asian Americans on math and reading achievement tests. Unfortunately, 
we are not close to knowing the answers to these seemingly basic, descriptive 
questions. In the remainder of this paper, I discuss several explanations for this 
knowledge gap. 



The most obvious reason why we have made so little progress on the test 
score gap puzzle since 1966 is that most researchers have been reluctant to 
study it. Rather than directly tackling this politically sensitive subject, most 
scholars have tried to understand ethnic inequalities in academic skills by com- 
paring socioeconomically disadvantaged students to advantaged students, by 
comparing students in high poverty schools to those in low poverty schools, or 
by comparing urban students to suburban students. All these comparisons pose 
interesting questions for social science. None, however, brings us closer to 
understanding ethnic differences in academic achievement because ethnicity 
does not overlap with social class and urbanism as much as most researchers 
assume. 

Table 3 illustrates this problem. It shows the magnitude of the black- 
white test score gap among a national sample of eighth graders, according to 
the education and income levels of their parents, as well as the poverty and 
urbanism of their schools. If these other variables were adequate substitutes 
for race, the black-white gap would disappear after these variables were taken 
into account. The black-white gap does shrink, but it is still large within each 
of these categories. More sophisticated analyses that simultaneously control 
many family background variables yield similar results (see Phillips et al. 1998). 9 
Racial and ethnic differences in test scores are not the same as SES differ- 



9 See also appendix C of Phillips, Crouse, and Ralph (1998) for data on how much the 
black-white gaps in Prospects and NELS shrink after controlling a number of common 
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